next up previous contents index Search
Next: 0.10.2.1 Source Code Up: 0.10 Data Compression Algorithms Previous: 0.10.1.2 References

     
0.10.2 Integer Coding

Section note: this description and source code are provided courtesy of Hugh Williams.

Integer coding is a method by which a set of integer values can be represented more efficiently. Two of the most often used methods of integer coding are called Elias codes and Golomb codes. Both Elias and Golomb codes are frequently used in compressing inverted indexes of English text.        

In the Elias gamma code, a positive integer x is represented by: $1 + \lfloor \log_{2}x \rfloor$ in unary. (that is, $\lfloor \log_{2}x \rfloor$ 0-bits followed by a 1-bit), followed by the binary representation of x without its most significant bit.

Thus the number nine is represented by 0001001, since: $1+\lfloor\log_2 9\rfloor=4$ or 0001 in unary, and 9 is 001 in binary with the most significant bit removed. In this way, 1 is represented by 1, that is, is represented in one bit. Gamma coding is efficient for small integers but is not suited to large integers for which parameterized Golomb codes or a second Elias code, the delta code, are more suitable.    

Elias delta codes are somewhat longer than gamma codes for small integers, but for larger integers, such as ordinal sequence numbers, the situation is reversed. A delta code stores the gamma code representation of an integer x, followed by the binary representation of x less the most significant bit. However, while Elias codes yield acceptable compression and fast decoding, better performance in both respects is possible with Golomb codes.    

Golomb codes are a form of parameterized coding in which integers to be coded are stored as values relative to a constant k. Using Golomb coding, a positive integer x is represented in two parts: the first is a unary representation of the quotient: $q=\lfloor(x-1)/k\rfloor+1$ the second is a binary representation of the remainder: x-qk-1. In this way, the binary representation requires $\lfloor\log k\rfloor or \lceil\log k\rceil$ bits.

Witten et al. report that for cases where the probability of any particular value occurring is small, an approximate calculation of kcan be used. Where there is a wide range of values to be coded and each occurs with reasonable frequency, a practical global approximation of the Golomb parameter k is

\begin{displaymath}k\approx
0.69\times{{N\times p}\over f}\end{displaymath}

where N is the number of documents, p is the number of distinct terms in the collection, and f is the count of document identifiers stored in inverted lists, that is, fis the sum of the lengths of all inverted lists. This model for selection of k is often referred to as a global Bernoulli model, since each term is assumed to have an independent probability of occurrence and the occurrences of terms have a geometric distribution.  

Another approach to selecting k is to use a local model. Local models use the information stored within a list to calculate an appropriate k value for that list; local models result in better compression than global models, but require a parameter for each locality. For example, by using a simple local Bernoulli model for storing sequence identifiers, a possible choice of a k value for a given list is an approximation of the mean difference between the document identifiers in that list, or, using the scheme above,

\begin{displaymath}k\approx 0.69\times {{N}\over {l}}\end{displaymath}

where l is the length of a given list, that is, the count of entries in the list.

Skewed Bernoulli models, where a simple mean difference is not used, typically result in better compression than simple local models.

With integers of varying magnitudes, as is the case in document occurrence counts and inverted-file offsets that vary from 1 to the database size, efficient storage is possible by using a variable-byte integer scheme. We use a representation in which seven bits in each byte is used to code an integer, with the least significant bit set to 0 if this is the last byte, or to 1 if further bytes follow. In this way, we represent small integers efficiently; for example, we represent 135 in two bytes, since it lies in the range

\begin{displaymath}[2^7\cdots
2^{14})\end{displaymath}

, as 00000011 00001110; this is read as 00000010000111 by removing the least significant bit from each byte and concatenating the remaining 14 bits.



 
Scott Gasch
1999-07-09