| Computes the Clarke–Cormack score of all interval iterators of a document.
This score function is defined in Charles L.A. Clarke and Gordon V. Cormack, “Shortest-Substring
Retrieval and Ranking”, ACM Transactions on Information Systems, 18(1):44−78, 2000,
at page 65.
The score for each index depends on two parameters: an integer h and a double α.
The score is obtained summing up a certain score assigned to all intervals in the interval iterator
under examination. The score assigned to an interval is 1 if the interval
has length smaller than h; otherwise, it is obtained by dividing h by
the interval length, and raising the result to the power of α.
Note that the score assigned to each interval is between 0 and 1 (highest scores corresponding
to best intervals). The score assigned to an interval iterator is thus bounded from above by the
number of intervals; an alternative version allows one to have normalized scores (in this case, the resulting
value is an average instead of a sum). A scorer with similar relative ranks, but inherently (almost) normalised
is provided by
it.unimi.dsi.mg4j.search.score.VignaScorer .
Typically, one sets h=16 (or a bit larger) and α=1 (or a bit smaller),
but the authors say that the method is rather stable w.r.t. changes in the values of parameters.
|