it.unimi.dsi.mg4j.util |
MG4J: Managing Gigabytes for Java
General-purpose utility classes.
Package Specification
The classes in this package are completely general utility classes that
may be useful outside of MG4J. In particular, {@link
it.unimi.dsi.mg4j.util.MutableString} is a highly-tuned, versatile class
for content-based mutable strings.
Another general-purpose important component is {@link
it.unimi.dsi.mg4j.util.MinimalPerfectHash}, which implements
state-of-the-art, hypergraph-based algorithms for the construction of
ordered minimal perfect hash tables.
If minimal perfect hash tables are too large, a good alternative is
an {@link it.unimi.dsi.mg4j.util.ImmutableExternalPrefixDictionary}, which keeps most data in compressed
form on disk, and uses a small in-memory index to access those data.
|
Java Source File Name | Type | Comment |
BitVectorBooleanIterator.java | Class | A boolean iterator returning the bits of a bit vector. |
BloomFilter.java | Class | A Bloom filter.
Instances of this class represent a set of character sequences or primitive-type
arrays (with false positives) using a Bloom filter. |
CircularCharArrayBuffer.java | Class | A circular char buffer, that can be used to implement a sliding
window over a text. |
Fast.java | Class | All-purpose optimised static-method container class. |
FlyweightPrototype.java | Interface | A prototype providing flyweight copies.
MG4J uses often flyweight copies to implement multithreading on read-only
(but maybe stateful) classes. |
FlyweightPrototypes.java | Class | A class providing static methods and objects that do useful things
with
. |
FrontCodedStringList.java | Class | Compact storage of strings using front-coding compression.
This class stores a list of strings using front-coding compression (of course,
the compression will be reasonable only if the list is sorted, but you could
also use instances of this class just as a handy way to manage a large
amount of strings). |
HashCodeSignedMinimalPerfectHash.java | Class | java.lang.String.hashCode -signed order-preserving minimal perfect hash tables. |
ImmutableBinaryTrie.java | Class | An immutable implementation of binary tries.
Instance of this class are built starting from a lexicographically ordered
list of
cern.colt.bitvector.BitVector s representing binary words. |
ImmutableExternalPrefixDictionary.java | Class | An immutable prefix dictionary mostly stored in external memory.
A prefix dictionary is a data structure that stores efficiently a lexicographically ordered list
of character sequences so that it is possible to obtain, given a character sequence s,
its position in the sequence, and also the (possibly empty) interval of the sequence composed by
character sequences that are extensions of s. |
ImmutableExternalTreePrefixDictionary.java | Class | An
it.unimi.dsi.mg4j.util.ImmutableExternalPrefixDictionary that compresses words using
a
it.unimi.dsi.mg4j.compression.HuffmanCodec and approximates
intervals using a
it.unimi.dsi.mg4j.util.TernaryIntervalSearchTree . |
ImmutableExternalTriePrefixDictionary.java | Class | An
it.unimi.dsi.mg4j.util.ImmutableExternalPrefixDictionary that compresses words using
a
it.unimi.dsi.mg4j.compression.HuTuckerCodec and approximates
intervals using an
it.unimi.dsi.mg4j.util.ImmutableTriePrefixTree that uses the same codec. |
ImmutableTriePrefixTree.java | Class | A class adapter from immutable binary tries to prefix trees.
Instances of this class map transparently strings into a binary trie. |
IntBloomFilter.java | Class | A Bloom filter for integers.
Instances of this class represent a set of integers (with false positives)
using a Bloom filter. |
LiterallySignedMinimalPerfectHash.java | Class | Literally signed order-preserving minimal perfect hash tables.
This class source exemplifies a signed minimal perfect hash table that
signes each term with the term itself, thus avoiding false positives. |
MG4JClassParser.java | Class | A small wrapper around JSAP's standard
ClassStringParser . |
MimeTypeResolver.java | Class | A thin wrapper around a singleton instance of
javax.activation.MimetypesFileTypeMap that tries to load /etc/mime.types into the map. |
MinimalPerfectHash.java | Class | Order-preserving minimal perfect hash tables.
Given a list of terms without duplicates,
the constructors of this class finds an order-preserving minimal perfect hash function for
the list. |
MutableString.java | Class | Fast, compact, optimised & versatile mutable strings.
Motivation
The classical Java string classes,
java.lang.String and
java.lang.StringBuffer , lie at the extreme of a spectrum (immutable and
mutable).
However, large-scale text indexing requires some features that are not
provided by these classes: in particular, the possibility of using a mutable
string, once frozen, in the same optimised way of an immutable
string. |
MutableStrings.java | Class | A static container providing utility objects and method for
it.unimi.dsi.mg4j.util.MutableString s. |
PermutedFrontCodedStringList.java | Class | A
it.unimi.dsi.mg4j.util.FrontCodedStringList whose indices are permuted.
It may happen that a list of strings compresses very well
using front coding, but unfortunately alphabetical order is not
the right order for the strings in the list. |
ProgressLogger.java | Class | Tunable progress logger.
This class provides a simple way to log progress information about long-lasting activities.
To use this class, you first create a new instance by passing a
, a
org.apache.log4j.Priority and a time interval in millisecond (you can use constants such as
ProgressLogger.ONE_MINUTE ).
Information will be logged about the current state of affairs no more often than the given
time interval. |
ProgressMeter.java | Class | Tunable progress meter.
This class provides a simple way to display progress information about long-lasting activities. |
Properties.java | Class | An extension of
org.apache.commons.configuration.PropertiesConfiguration providing setters for primitive types, a simpler
and transparently handling of
java.lang.Enum lowercased keys.
All accessors defined in
org.apache.commons.configuration.PropertiesConfiguration have a
polymorphic counterpart taking an
java.lang.Enum instead of a string:
java.lang.Enum.name and
java.lang.String.toLowerCase are applied before
delegating to the corresponding string-based method. |
SemiExternalOffsetList.java | Class | Provides semi-external random access to offsets of an
it.unimi.dsi.mg4j.index.Index index . |
ShiftAddXorLongSignedMinimalPerfectHash.java | Class | Order-preserving minimal perfect hash tables signed using Shift-Add-Xor long hashes. |
ShiftAddXorSignedMinimalPerfectHash.java | Class | Order-preserving minimal perfect hash tables signed using Shift-Add-Xor hashes. |
SignedMinimalPerfectHash.java | Class | Signed order-preserving minimal perfect hash tables.
Minimal perfect hash tables will always return a result, even for terms that were
not present in the collection indexed by the table. |
TernaryIntervalSearchTree.java | Class | Ternary interval search trees.
Ternary search trees are a data structure used to store words over an alphabet; they are
a useful alternatives to tries when the alphabet is large.
Ternary interval search trees have the additional properties of being able
to locate quickly intervals of words extending a given prefix (where “quickly” means
that no more successful character comparisons than the prefix length are performed). |
TextPattern.java | Class | QuickSearch matching against a constant string.
The
of the Java API
are a powerful tool; however, when searching for a constant pattern
many algorithms can increase of orders magnitude the speed of a search.
This class provides constant-pattern text search facilities by
implementing Sunday's QuickSearch (a simplified but very effective variant
of the Boyer—Moore search algorithm) using compact
approximators, a randomised data structure that can accomodate in a
small space (but in an approximated way) the bad-character shift table of a
large alphabet such as Unicode.
Since a large subset of US-ASCII is used in all languages (e.g.,
whitespace, punctuation, etc.), this class caches separately the shifts for
the first 128 Unicode characters, resulting in very good performance even on
text in pure US-ASCII.
Note that the
MutableString.indexOf(MutableString) indexOf methods of
MutableString use a even more simplified variant of
QuickSearch which is less efficient, but has a smaller setup time and does
not generate any object. |