Length-Dinky Prefix Codes

logo

posted by Stephan Brumme

Overview

Essentially the most regular methodology to generate prefix-free codes is the Huffman algorithm.


It is allotment of many standard compression algorithms reminiscent of JPEG,
DEFLATE (ZIP /
GZIP / PNG / etc.).


Typically it serves as an “afterburner” to compress pre-processed symbols.

The principle advantage – variable-dimension codes – is known as a excessive downside, too: if the utmost code dimension exceeds a definite threshold,
e.g. the CPU’s integer dimension, (16 / 32 / 64 bit), then the compression/decompression code becomes advanced and slows down.


All of the aforementioned file formats place a strict higher restrict on the utmost dimension of Huffman codes.


Mainly on account of their age and low quantity of distinct symbols, all of them agreed on 16 bits per code.

So-known as dimension-dinky prefix codes will not be Huffman codes anymore because they place not appear to be optimal (if the dimensions restriction develop into as soon as enforced).


The loss in compression efficiency turns out to be in most cases negligible and thus is widely authorized on account of the quite a bit of own in decompression performance.

There may perhaps be somewhat a pair of algorithms to present such dimension-dinky prefix codes. They vary referring to walk, efficiency and complexity.


I wrote a C library imposing the commonest methods. Many of the code derives from open source programs and develop into as soon as invented by others.

My vital contribution develop into as soon as to present a shared interface in remark that these algorithms are simply interchangeable:



veil

interface


unsigned char algorithmName(unsigned char maxLength, unsigned int numCodes, const unsigned int histogram[], unsigned char codeLengths[]);

There are three enter parameters:

  • maxLength is the upper restrict of the quantity of bits for every and every prefix code
  • numCodes is the quantity of distinct symbols (including unused symbols) → dimension of your alphabet
  • histogram is an array of numCodes counters how in most cases each and every image is portray

and one output parameter:

  • codeLengths will have the bit dimension of every and every image

The return payment is:

  • the longest bit dimension or zero if the algorithm failed

If you happen to like to wish to encode the sequence 0 3 1 1 2 5 2 2 then numCodes = 6 and histogram[] = { 1,2,3,1,0,1 };

Nonetheless, this shared interface comes with a diminutive bit of overhead: each and every so again and again doubling code dimension and/or execution time.


Attributable to this reality most files bear a 2nd public characteristic which is more specialized for its algorithm but may perhaps perchance perhaps also simply bear a pair of further restrictions.


An everyday case is that the histogram must be sorted and unused symbols removed.

For the time being utilized are:

Utilization

Every algorithm runs self sustaining of every and every other. There don’t appear to be any exterior dependencies, too.


Some algorithms require pre-computed Huffman codes – for these cases I added a rather modified version of
Andrew Moffat’s atmosphere friendly in-establish of dwelling Huffman code.


If you happen to like to wish to confirm out out an algorithm then all you want present is:



veil

Utilization


#consist of "packagemerge.h"
#consist of

int vital()
{
// let's own you may perhaps perchance perhaps even bear 7 symbols
unsigned int numCodes = 7;
// first one happens 270 cases, 2nd 20x, third 10x, fourth under no circumstances, fifth as soon as, sixth 6x and seventh as soon as
unsigned int histogram[7] = { 270,20,10,0,1,6,1 };

// at most 4 bits
unsigned char maxLength = 4;
// output: bits per code
unsigned char codeLengths[7];

// compute dimension-dinky codes
unsigned char longest = packageMerge(maxLength, numCodes, histogram, codeLengths);

// pronounce end result
for (unsigned int i = 0; i < numCodes; i++)
printf("code %d occurs %d times => %d bitsn", i, histogram[i], codeLengths[i]);
printf("max. code dimension is %dn", longest);

return 0;
}

When in comparison with the customary Huffman algorithm, a pair of code grew to change into shorter and a pair of grew to change into longer.


The amount of bits / codeLengths are:

Symbol 0 1 2 3 4 5 6
Depend 270 20 10 0 1 6 1 sum = 308
Huffman 1 2 3 0 5 4 5 sum = 374 bits
maxLength = 4 1 2 4 0 4 4 4 sum = 382 bits
maxLength = 3 2 2 3 0 3 3 3 sum = 634 bits

As you may perhaps perchance perhaps look, limiting the utmost code lengths hurts the compression ratio.

There are quite a bit of methods to rework code lengths to proper prefix-free codes.
I derive canonical Huffman codes to be the most upright: a straightforward and hasty algorithm.


Utilizing the instance files we score these canonical codes:

Symbol 0 1 2 3 4 5 6
Depend 270 20 10 0 1 6 1
Huffman 0 10 110 11110 1110 11111
maxLength = 4 0 10 1100 1101 1110 1111
maxLength = 3 00 01 100 101 110 111

An everyday program exhibiting the style to compute canonical codes (click on to open):



portray

canonical.c


#consist of "packagemerge.h"
#consist of


// produce canonical codes, for simplicity longest code must never exceed quantity of bits of an unsigned int (32 or 64)
void canonical(unsigned int numCodes, unsigned char codeLengths[], unsigned int codes[])
{
// count how in most cases each and every code dimension is portray
unsigned int numLengths[256] = { 0 };
unsigned int i;
for (i = 0; i < numCodes; i++)
numLengths[codeLengths[i]]++;

// precompute the most most considerable code of every and every code dimension
unsigned int subsequent[256];
// there isn't any code with dimension 0 and the most most considerable 1-bit code with be a zero
subsequent[1] = 0;
for (i = 2; i < 256; i++)
subsequent[i] = (subsequent[i - 1] + numLengths[i - 1]) << 1;

// now attach all codes
for (i = 0; i < numCodes; i++)
codes[i] = subsequent[codeLengths[i]]++;
}


int vital()
{
// let's own you may perhaps perchance perhaps even bear 7 symbols
unsigned int numCodes = 7;
// first one happens 270 cases, 2nd 20x, third 10x, fourth no at all, fifth as soon as, sixth 6x and seventh as soon as
unsigned int histogram[7] = { 270,20,10,0,1,6,1 };

// at most 4 bits
unsigned char maxLength = 4;
// output: bits per code
unsigned char codeLengths[7];

// compute dimension-dinky codes
unsigned char longest = packageMerge(maxLength, numCodes, histogram, codeLengths);

// produce canonical codes
unsigned int codes[7];
canonical(numCodes, codeLengths, codes);

// pronounce end result
unsigned int i;
for (i = 0; i < numCodes; i++)
{
printf("code %d occurs %d timest=> %d bits", i, histogram[i], codeLengths[i]);

// binary represenation of canonical code
printf("t=> ");
if (codeLengths[i] > 0)
{
int shift;
for (shift = codeLengths[i] - 1; shift >= 0; shift--)
printf(codes[i] & (1 << shift) ? "1" : "0");
}
printf("n");
}
printf("max. code length is %dn", longest);

return 0;
}

Benchmark

Here are a few results from the first 64k bytes of enwik, measured on a Core i7 / GCC x64:

time ./benchmark 1 12 100000

  • where 1 is the algorithm’s ID and was between 0 and 6
  • limiting the code length to at most 12 bits
    • note: the unadjusted Huffman codes have up to 16 bits
  • each algorithm ran 100000 times

For comparison: the uncompressed data has 64k bytes = 524,288 bits:

algorithm ID total size compression ratio execution time
Huffman 0 326,892 bits 62.35% 0.54 s
Package-Merge 1 327,721 bits 62.51% 1.17 s
MiniZ 2 328,456 bits 62.65% 0.58 s
JPEG 3 328,456 bits 62.65% 0.61 s
BZip2 4 328,887 bits 62.73% 0.88 s
Kraft 5 327,895 bits 62.54% 1.72 s
modified Kraft 6 327,942 bits 62.55% 0.41 s

Remember: each algorithm was repeated 100,000 times, so a single iteration finished in about 4 to 17 microseconds.

Changing the code length limit significantly influences both execution time and compression ratio.


I picked the two “best” algorithms, used the same data set and repeated each procedure 100,000 times:

length limit Package-Merge Kraft Strategy B
8 bits 70.47%, 0.96 s 70.76%, 0.24 s
9 bits 65.30%, 1.02 s 65.31%, 0.24 s
10 bits 63.49%, 1.07 s 63.79%, 0.31 s
11 bits 62.80%, 1.14 s 62.84%, 0.37 s
12 bits 62.51%, 1.17 s 62.55%, 0.40 s
13 bits 62.40%, 1.22 s 62.43%, 0.34 s
14 bits 62.36%, 1.25 s 62.42%, 0.40 s
15 bits 62.35%, 1.29 s 62.42%, 0.66 s
16 bits 62.35%, 1.35 s 62.42%, 0.70 s

For comparison: Moffat’s Huffman algorithm needs 0.55 seconds and its longest prefix code has 16 bits.

Package-Merge

Package-Merge is an algorithm that guarantees to produce optimal length-limited prefix codes.


Its name is based on the conception of putting numbers in packages and merging these.

Unfortunately, the original paper by Lawrence L. Larmore and Daniel S. Hirschberg isn’t very accessible.


Moreover, further information is pretty scarce on the internet: the Wikipedia entry doesn’t help a lot.


The best introduction – in my opinion – is Sebastian Gesemann’s Bachelor’s Thesis but it’s available in German only.


The diagrams on page 24 and 26 walk through an example and page 29 gives pseudo code.

There are two prerequisites:

  • remove all unused symbols before running Package-Merge
  • and symbols must must be sorted in ascending order of their frequency

Let’s process the sample data shown above:

Symbol 0 1 2 3 4 5 6
Count 270 20 10 0 1 6 1

Obviously, the prerequisites aren’t fulfilled yet: unused symbol 3 needs to be removed and the remaining symbols must be in ascending order of their count:


(the order of symbols with the same count doesn’t matter)

Symbol 4 6 5 2 1 0
Count 1 1 6 10 20 270

Since there were 6 symbols (plus an unused one), we need at least 3 bits to encode them.


Naturally you can’t have an active symbol with zero bits, the minimum code length is 1 bit.


Package-Merge adds one bit in each iteration so there is a minimum of two iterations (2+1=3) but for whatever reason I decided to have this example with maxLength=4.

Now we add these counts pairwise, we build packages. If there is an odd number of symbols then the last (which is the largest) is discarded:

Symbol 4 6 5 2 1 0
Count 1 1 6 10 20 270
Pairwise Sum 2 16 290

These packages have to be merged with the original sequence in ascending order.


If a package is identical to a count of the original sequence then that item from the original sequence comes first.


In the next table packages are shown in a bold font:

Count 1 1 2 6 10 16 20 270 290

This was one iteration of phase 1. Each iteration represents one bit.


Therefore we need to repeat that procedure two more times to reach our goal maxLength=4.


However, there is one little pitfall: pairs are made using the sequence from the previous step but you always merge them with the original sequence.


That last sentence is important and was probably my most critical insight when I learned about the Package-Merge algorithm.

Okay, let’s create package for the second bit:

Count 1 1 2 6 10 16 20 270 290
Pairwise Sum 2 8 26 290

As you can see, there is an odd number of items and thus the last item (290 in the upper row) must be discarded.


The four values in the lower row are merged with the original sequence:

Count 1 1 2 6 8 10 20 26 270 290

One more time “packaging”:

Count 1 1 2 6 8 10 20 26 270 290
Pairwise Sum 2 8 18 46 560

And merging:

Count 1 1 2 6 8 10 18 20 46 270 560

Phase 1 has completed. Let’s gather all intermediate results from each iteration in one table:

Original 1 1 6 10 20 270
Iteration 1 1 1 2 6 10 16 20 270 290
Iteration 2 1 1 2 6 8 10 20 26 270 290
Iteration 3 1 1 2 6 8 10 18 20 46 270 560

Phase 2 processes the output of each iteration, going from the bottom row to the top row.


The actual values in the table don’t matter – it’s only relevant whether a value is a package (bold) or not (plain).

Code lengths for all symbols are initially set to zero.


If a value isn’t bold then the code length of a symbol has to be incremented.

The first iteration processes the first 2*numCodes-2=10 values of the bottom row, that means 1,1,...,270 but not 560.


Two counters symbol and numMerged are set to zero.

For each table entry a decision has to be made:

  • if that value is a package (bold) then increment numMerged
  • otherwise increment codeLengths[symbol] and then symbol

The next iteration has to look at the first 2*numMerged table entries.

Let’s go step-by-step through the example – as mentioned before we start with the bottom row.


The code length of all symbols was set to zero.

(array index) 0 1 2 3 4 5
Symbol 4 6 5 2 1 0
Count 1 1 6 10 20 270
Code Length (initially) 0 0 0 0 0 0
Code Length

(after processing bottom row)
1 1 1 1 1 1

What happened ? We looked at ten values of iteration 3 of the first phase:

  1. a 1 which wasn’t bold → codeLength[0] = 1 and then symbol = 1
  2. another 1 which wasn’t bold → codeLength[1] = 1 and then symbol = 2
  3. a 2 which is boldnumMerged = 1
  4. a 6 which wasn’t bold → codeLength[2] = 1 and then symbol = 3
  5. a 8 which is boldnumMerged = 2
  6. a 10 which wasn’t bold → codeLength[3] = 1 and then symbol = 4
  7. a 18 which is boldnumMerged = 3
  8. a 20 which wasn’t bold → codeLength[4] = 1 and then symbol = 5
  9. a 46 which is boldnumMerged = 4
  10. a 270 which wasn’t bold → codeLength[5] = 1 and then symbol = 6

Somehow that wasn’t very exciting … all code lengths changed from zero to one.


Because the final value of numMerged is 4, we look at the first 4*2=8 values of the second-to-last row.


Its content was (just a copy, I stroke through irrelevant entries):

Iteration 2 1 1 2 6 8 10 20 26 270 290

This time we cope only with three bold entries and five non-bold.


That means numMerged = 3 and the code length of all-but-the-last symbol becomes 2.


That last symbol, which had the highest count, suddenly gets assigned a lower code length than all the other symbols.

Again, a copy of phase 1’s iteration 1 (where only 2*numMerged = 6 will be handled):

Iteration 1 1 1 2 6 10 16 20 270 290

There are two bold entries (merged packages) and four non-bold.


You probably already understand that the first four symbol’s code length needs to be incremented and


we will look at the first 2*numMerged = 4 numbers of the final row (the top row).

Obviously there can’t be any bold values.


The first four symbol’s code length is incremented again. Let’s take a look at the code lengths of each step:

(array index) 0 1 2 3 4 5
Symbol 4 6 5 2 1 0
Count 1 1 6 10 20 270
Code Length (initially) 0 0 0 0 0 0
Code Length

(after processing bottom row /

iteration 3)
1 1 1 1 1 1
Code Length

(after processing iteration 2)
2 2 2 2 2 1
Code Length

(after processing iteration 1)
3 3 3 3 2 1
Code Length

(after processing top row)
4 4 4 4 2 1

That’s it ! The bottom code contains the final code lengths.


No symbol’s number of bits is higher than maxLength=4 and the Kraft-McMillan inequality holds true as well: 2^-4 + 2^-4 + 2^-4 + 2^-4 + 2^-2 + 2^-1 = 1.


The algorithm is guaranteed to create optimal code lengths for a prefix-free code. No other combination of code lengths will be better.

I left out any theoretical stuff, such as proofs etc., and skipped practical optimizations to keep things short and sweet.


Take a look at my code (click below) or read the scientific papers to learn more details.



show

packagemerge.c


// //////////////////////////////////////////////////////////
// packagemerge.c
// written by Stephan Brumme, 2021
// see https://create.stephan-brumme.com/length-limited-prefix-codes/
//

#include "packagemerge.h"
#include // malloc/free/qsort


// ----- package-merge algorithm -----

// to me the best explanation is Sebastian Gesemann's Bachelor Thesis (in German only / University of Paderborn, 2004)

// data types (switching to unsigned int is faster but fails if sum(histogram) > 2^31 or maxLength > 31)
typedef unsigned prolonged prolonged BitMask;
typedef unsigned prolonged prolonged HistItem;

/// compute dinky prefix code lengths according to Larmore/Hirschberg's equipment-merge algorithm
/- histogram must be in ascending pronounce and no entry must be zero
- the characteristic rejects maxLength > 63 but I don't look any gleaming reasons you'd desire a elevated restrict ...
@param maxLength maximum code dimension, e.g. 15 for DEFLATE or JPEG
@param numCodes quantity of codes, equals the array dimension of histogram and codeLength
@param A [in] how in most cases each and every code/image develop into as soon as found out [out] computed code lengths
@end result proper maximum code dimension, 0 if error
*/
unsigned char packageMergeSortedInPlace(unsigned char maxLength, unsigned int numCodes, unsigned int A[])
{
// skip zeros
while (numCodes > 0 && A[0] == 0)
{
numCodes--;
A++;
}

// at the least one code desires to be in employ
if (numCodes == 0 || maxLength == 0)
return 0;

// one or two codes are repeatedly encoded with a single bit
if (numCodes <= 2)
{
A[0] = 1;
if (numCodes == 2)
A[1] = 1;
return 1;
}

// A[] is an input parameter (stores the histogram) as well as
// an output parameter (stores the code lengths)
const unsigned inthistogram = A;
unsigned intcodeLengths = A;

// my allround variable for various loops
unsigned int i;

// check maximum bit length
if (maxLength > 8*sizeof(BitMask) - 1) // 8*8-1 = 63
return 0;

// at the least log2(numCodes) bits required for every and every legit prefix code
unsigned prolonged prolonged encodingLimit = 1ULL << maxLength;
if (encodingLimit < numCodes)
return 0;

// need two buffers to direction of iterations and an array of bitmasks
unsigned int maxBuffer = 2 numCodes;
// allocate memory
HistItempresent = (HistItem*) malloc(sizeof(HistItem) maxBuffer);
HistItemprevious = (HistItem*) malloc(sizeof(HistItem) maxBuffer);
BitMaskisMerged = (BitMask*) malloc(sizeof(BitMask) maxBuffer);

// initial payment of "previous" is a straightforward replica of the sorted histogram
for (i = 0; i < numCodes; i++)
previous[i] = histogram[i];
unsigned int numPrevious = numCodes;
// no must initialize "present", it is entirely rebuild each and every iteration

// withhold word which applications are merged (compact bitmasks):
// if equipment p develop into as soon as merged in iteration i then (isMerged[p] & (1 << i)) != 0
for (i = 0; i < maxBuffer; i++)
isMerged[i] = 0; // there are no merges before the first iteration

// the last 2 packages are irrelevant
unsigned int numRelevant = 2 numCodes - 2;

// ... and preparation is finished

// //////////////////////////////////////////////////////////////////////
// iterate through potential bit lengths while packaging and merging pairs
// (step 1 of the algorithm)
// - the histogram is sorted (prerequisite of the function)
// - the output must be sorted, too
// - thus we have to copy the histogram and every and then insert a new package
// - the code keeps track of the next package and compares it to
// the next item to be copied from the history
// - the smaller value is chosen (if equal, the histogram is chosen)
// - a bitmask named isMerged is used to keep track which items were packages
// - repeat until the whole histogram was copied and all packages inserted

// bitmask for isMerged
BitMask mask = 1;
unsigned char bits;
for (bits = maxLength - 1; bits > 0; bits--)
{
// ignore closing ingredient if numPrevious is atypical (can't be paired)
numPrevious &= ~1; // bit-twiddling trick to clear the bottom bit, same as numPrevious -= numPrevious % 2

// first merged equipment
present[0] = histogram[0]; // a sum can't be smaller than its system
present[1] = histogram[1]; // therefore it is very unlikely to search out a equipment at index 0 or 1
HistItem sum = present[0] + present[1]; // same as previous[0] + previous[1]

// replica histogram and insert merged sums whenever that you may perhaps perchance perhaps presumably be also enlighten of
unsigned int numCurrent = 2; // present[0] and present[1] had been already establish of dwelling
unsigned int numHist = numCurrent; // we took them from the histogram
unsigned int numMerged = 0; // but thus a long way no equipment inserted (on the opposite hand, it is precomputed in "sum")
for (;;) // end/damage is within the loop
{
// the subsequent equipment isn't higher than the subsequent histogram merchandise ?
if (numHist < numCodes && histogram[numHist] <= sum)
{
// copy histogram item
current[numCurrent++] = histogram[numHist++];
continue;
}

// okay, we have a package being smaller than next histogram item

// mark output value as being "merged", i.e. a package
isMerged[numCurrent] |= mask;

// store package
current [numCurrent] = sum;
numCurrent++;

// already finished last package ?
numMerged++;
if (numMerged 2 >= numPrevious)
damage;

// precompute subsequent sum
sum = previous[numMerged 2] + previous[numMerged 2 + 1];
}

// make sure each and every code from the histogram is included
// (relevant if histogram may perhaps be very skewed with a pair of outliers)
while (numHist < numCodes)
present[numCurrent++] = histogram[numHist++];

// put together subsequent conceal
conceal <<= 1;

// performance tweak: abort as soon as "previous" and "current" are identical
if (numPrevious >= numRelevant) // ... at the least their relevant system
{
// fundamentally a bool: FALSE == 0, TRUE == 1
char keepGoing = 0;

// review each and every arrays: if they're same then end
for (i = numRelevant - 1; i > 0; i--) // collisions are presumably at the pause
if (previous[i] != present[i])
{
keepGoing++;
damage;
}

// early exit ?
if (keepGoing == 0)
damage;
}

// swap pointers "previous" and "present"
HistItemtmp = previous;
previous = present;
present = tmp;

// no must swap their sizes because finest numCurrent wanted in subsequent iteration
numPrevious = numCurrent;
}

// shifted one bit too a long way
conceal >>= 1;

// withhold finest isMerged
free(previous);
free(present);

// //////////////////////////////////////////////////////////////////////
// tracking all merges will create the code lengths
// (step 2 of the algorithm)
// - analyze each and every bitlength's conceal in isMerged:
// a "pure" image => amplify bitlength of that image
// a merged code => upright amplify counter
// - end if no more merged codes found out
// - if m merged codes had been found out then finest word
// the most most considerable 2*m system within the subsequent iteration
// (because finest they fashioned these merged codes)

// reset code lengths
for (i = 0; i < numCodes; i++)
codeLengths[i] = 0;

// initiate with examining the most most considerable 2n-2 values
unsigned int numAnalyze = numRelevant;
while (conceal != 0) // stops if nothing but symbols are found out in an iteration
{
// quantity of merged applications seen thus a long way
unsigned int numMerged = 0;

// the most most considerable two system must be symbols, they may be able to't be applications
codeLengths[0]++;
codeLengths[1]++;
unsigned int image = 2;

// glimpse at applications
for (i = image; i < numAnalyze; i++)
{
// check bitmask: not merged if bit is 0
if ((isMerged[i] & mask) == 0)
{
// we have a single non-merged symbol, which needs to be one bit longer
codeLengths[symbol]++;
symbol++;
}
else
{
// we have a merged package, so that its parts need to be checked next iteration
numMerged++;
}
}

// look only at those values responsible for merged packages
numAnalyze = 2 numMerged;

// note that the mask was originally slowly shifted left by the merging loop
mask >>= 1;
}

// closing iteration can't bear any merges
for (i = 0; i < numAnalyze; i++)
codeLengths[i]++;

// it's a free world ...
free(isMerged);

// first symbol has the longest code because it's the least frequent in the sorted histogram
return codeLengths[0];
}


// the following code is almost identical to function moffat() in moffat.c


// helper struct for qsort()
struct KeyValue
{
unsigned int key;
unsigned int value;
};
// helper function for qsort()
static int compareKeyValue(const voida, const voidb)
{
struct KeyValueaa = (struct KeyValue*) a;
struct KeyValuebb = (struct KeyValue*) b;
// negative if a < b, zero if a == b, positive if a > b
if (aa->key < bb->key)
return -1;
if (aa->key > bb->key)
return +1;
return 0;
}


/// same as sooner than but histogram might be in any pronounce and may perhaps perchance perhaps simply have zeros, the output is kept in a dedicated parameter
/- the characteristic rejects maxLength > 63 but I don't look any gleaming reasons you'd desire a elevated restrict ...
@param maxLength maximum code dimension, e.g. 15 for DEFLATE or JPEG
@param numCodes quantity of codes, equals the array dimension of histogram and codeLength
@param histogram how in most cases each and every code/image develop into as soon as found out
@param codeLength [out] computed code lengths
@end result proper maximum code dimension, 0 if error
*/
unsigned char packageMerge(unsigned char maxLength, unsigned int numCodes, const unsigned int histogram[], unsigned char codeLengths[])
{
// my allround variable for assorted loops
unsigned int i;

// reset code lengths
for (i = 0; i < numCodes; i++)
codeLengths[i] = 0;

// count non-zero histogram values
unsigned int numNonZero = 0;
for (i = 0; i < numCodes; i++)
if (histogram[i] != 0)
numNonZero++;

// reject an empty alphabet because malloc(0) is undefined
if (numNonZero == 0)
return 0;

// allocate a buffer for sorting the histogram
struct KeyValuemapping = (struct KeyValue*) malloc(sizeof(struct KeyValue) numNonZero);
// replica histogram to that buffer
unsigned int storeAt = 0;
for (i = 0; i < numCodes; i++)
{
// skip zeros
if (histogram[i] == 0)
continue;

mapping[storeAt].key = histogram[i];
mapping[storeAt].payment = i;
storeAt++;
}
// now storeAt == numNonZero

// invoke C regular library's qsort
qsort(mapping, numNonZero, sizeof(struct KeyValue), compareKeyValue);

// extract ascendingly ordered histogram
unsigned intsorted = (unsigned int*) malloc(sizeof(unsigned int) numNonZero);
for (i = 0; i < numNonZero; i++)
sorted[i] = mapping[i].key;

// sprint equipment-merge algorithm
unsigned char end result = packageMergeSortedInPlace(maxLength, numNonZero, sorted);

// "unsort" code lengths
for (i = 0; i < numNonZero; i++)
codeLengths[mapping[i].value] = sorted[i];

// let it scoot ...
free(sorted);
free(mapping);

return end result;
}

Kraft-McMillan Inequality

In a single among the closing sentences I talked about the Kraft-McMillan inequality (each and every so again and again McMillan is forgotten).


It is core message is: if sum(2^-codeLength[0...numCodes]) then a prefix-free code exists.

Ideally you are trying to bear sum(2^-codeLength[0...numCodes]) = 1 ("equal" in establish of dwelling of "much less-than or "equal") because the rest is inefficient.


The Huffman algorithm repeatedly presents you sum = 1. The Kit-Merge algorithm has the same property.

The following dimension-limiting methods alter code lengths within the kind of capacity that the utmost code dimension is enforced ("dinky") while the inequality calm holds upright.

Clutch that you ran the Huffman algorithm and situated out that a pair of codes are too prolonged. You know that sum = 1.


A truly powerful insight is that making one among such prolonged codes finest one bit shorter increments the sum, thus violating the inequality:

newSum = sum - 2^-oldLength + 2^-newLength = sum - 2^-oldLength + 2^(oldLength-1) = sum + 2^-oldLength

Nonetheless, if there is any other image with a code dimension that is shorter than your required maxLength we can invent it one bit longer to offset our replace.


The formula is virtually the same (but plus one in establish of dwelling of minus one within the exponent).


Somewhat mighty all algorithm talked about within the follwing chapters pair prolonged with short codes such that the fresh dimension restrict is respected and the inequality holds upright.

JPEG's Length-Limiting Intention

This is the shortest algorithm. It might probably be found out in Annex Okay.3 of the JPEG regular ITU T.81 alongside with a nice drift plan.


You might perhaps perhaps additionally score that doc from w3.org, look web remark 147 of that PDF.

First you produce a worn Huffman code. If the utmost code dimension doesn't exceed your restrict then you may perhaps perchance perhaps very successfully be obviously done.

If you happen to glimpse at all codes having the longest code dimension then there is repeatedly an even quantity of them


(if that wasn't the case then one image may perhaps perchance perhaps presumably be made one bit shorter because its closing bit is unused → an atypical quantity must be an error to your Huffman implementation).

You repeat this direction of till reaching maxLength:

  • select two symbols x and y having that longest bit dimension
  • the canonical Huffman code of x and y might be virtually same: x ends with 0 and y ends with 1
  • let's call their shared prefix P
  • we resolve x's trailing 0, in remark that x becomes P, being one bit shorter
  • on the opposite hand, y is invalid now
  • let's select a third image z which is shorter than the fresh x (which implies at the least two bits shorter than the calm invalid y)
  • appending a 0 to z makes it one bit longer
  • appending a 1 to z creates an absolutely fresh code that might be old style for y

What did we pause ?

  1. we made 1 image shorter by exactly 1 bit (x)
  2. we made 1 image shorter by at the least 1 bit (y)
  3. we made 1 image longer by exactly 1 bit (z)

Why does it work ?

  • from a mathematical point of glimpse the sum of all Kraft values must be
  • we know that bits(x) = bits(y)
  • if bits(z) = bits(x) - 2 then sum(Kraft) = sum(Kraft')
    • because bits(x') = bits(x) - 1
    • and bits(y') = bits(y) - 1
    • and bits(z') = bits(z) + 1
    • in remark that bits(z') = bits(x') = bits(y')
  • let's ignore all symbols with the exception of x, y and z in remark that we've:


    sum(Okay) = Okay(x) + Okay(y) + Okay(z)


    = 2^-bits(x) + 2^-bits(y) + 2^-bits(z)


    = 2^-bits(x) + 2^-bits(x) + 2^-(bits(x) - 2)


    = 2 2^-bits(x) + 4 -bits(x)


    = 6 2^-bits(x)


    sum(Okay') = Okay(x') + Okay(y') + Okay(z')


    = 3 Okay(x')


    = 3 2^-bits(x')


    = 3 2^(-bits(x) - 1)


    = 6 2^-bits(x)


    = sum(Okay)
  • therefore the sum of all Kraft values stays unchanged after the transformation

Every step reduces the bit lengths of upright two symbols by generally one bit.


Sadly, processing a immense alphabet with very grand bit lengths might be somewhat dull.


Nonetheless, this allege may perhaps be very unlikely with the little JPEG alphabet (generally 162 symbols with an higher code dimension restrict of 16).

The code simplifes and becomes sooner if work change accurate into a histogram of code lengths in establish of dwelling of the code lengths themselves.


That capacity you sprint a single stream over all code lengths and count what number of symbols bear one bits, two bits, three bits, etc.

The internal loop of my characteristic limitedJpegInPlace is surprisingly shprt:



veil

limitedJpegInPlace


unsigned char limitedJpegInPlace(unsigned char newMaxLength, unsigned char oldMaxLength, unsigned int histNumBits[])
{
if (newMaxLength <= 1)
return 0;
if (newMaxLength > oldMaxLength)
return 0;
if (newMaxLength == oldMaxLength)
return newMaxLength;

// iterate over all "excessive" bit lengths, starting with the longest
unsigned char i = oldMaxLength;
while (i > newMaxLength)
{
// no codes at this bit dimension ?
if (histNumBits[i] == 0)
{
i--;
continue;
}

// glimpse codes that are at the least two bits shorter
unsigned char j = i - 2;
while (j > 0 && histNumBits[j] == 0)
j--;

// replace bit dimension of 2 of the longest codes
histNumBits[i] -= 2;
// one code becomes one bit shorter
// (using the joint prefix of the celebrated two codes)
histNumBits[i - 1]++;

// the opposite code has dimension j+1 now
// but any other, not-yet-eager, code with dimension j
// is moved to bit dimension j+1 as successfully
histNumBits[j + 1] += 2;
histNumBits[j]--;
}

// return longest code dimension that is calm old style
while (i > 0 && histNumBits[i] == 0)
i--;

// JPEG Annex Okay.3 specifies an further line:
// histNumBits[i]--;
// => because JPEG desires a somewhat a pair of image to withhold a long way flung from 0xFF in its output stream

return i;
}

MiniZ's Length-Limiting Algorithm

The DEFLATE file format imposes a 16 bit restrict in its Huffman stage.


Unlike JPEG there isn't the kind of thing as a legit dimension-limiting direction of in remark that programmers came up with somewhat a pair of (but somehow very same) code.


If found out the most simply identifiable code develop into as soon as MiniZ's huffman_enforce_max_code_size() (written by Rich Geldreich).

gzip and its zlib are more standard but the capacity they restrict code lengths is considerably hidden and interweaved with other functionality.


As a long way as I realize each and every MiniZ's and zlib's dimension-limiting behaves identically.

While MiniZ's regular opinion is reminiscent of JPEG's (for hundreds of enter they create same output), it is mighty sooner:


in establish of dwelling of slowly "eating" one bit away at a time from prolonged codes, it accurate away shortens all (!) prolonged codes to the fresh maxLength restrict.


Now the Kraft-McMillan inequality may perhaps perchance perhaps also very successfully be violated in remark that a pair of codes must calm be mounted:

  1. select a code x with maximum dimension (=maxLength)
  2. select any other code y that is shorter, if that you may perhaps perchance perhaps presumably be also enlighten of maxLength-1 bits (even shorter codes work, too, but are inefficient)
  3. invent y one bit longer
  4. attach x the same dimension as the "fresh" y
  5. we upright reduced the Kraft-McMillan sum by (at the least) 2^-maxLength
  6. if it calm exceeds 1 then continue with step 1

Point to: it is that you may perhaps perchance perhaps presumably be also enlighten of (and somewhat seemingly !) that the chosen code x will get assigned the same code dimension it had sooner than.


For example, if newMaxLength is 16 and you may perhaps perchance perhaps simply bear codes with 15 bits, too.


Now when you select a 16 bit code and select a 2nd 15 bit code then that 15 bit code becomes a 16 bit code and the 16 bit code stays a 15+1=16 bit code.

The Kraft-McMillan computations might be performed with integers if we multiply everything by 2^maxLength:

2^-maxLength 2^maxLength = 1, thus each and every iteration "saves" one unit and the total sum must be under 1 2^maxLength = 2^maxLength.

If a histogram of code lengths is old style in establish of dwelling of code dimension themselves (upright like the JPEG algorithm) then the internal loop is upright 50 strains, and most of it are comments:



veil

limitedMinizInPlace


unsigned char limitedMinizInPlace(unsigned char newMaxLength, unsigned char oldMaxLength, unsigned int histNumBits[])
{
if (newMaxLength <= 1)
return 0;
if (newMaxLength > oldMaxLength)
return 0;
if (newMaxLength == oldMaxLength)
return newMaxLength;

// my allround variable for assorted loops
unsigned int i;

// scoot all outsized code lengths to the longest allowed
for (i = newMaxLength + 1; i <= oldMaxLength; i++)
{
histNumBits[newMaxLength] += histNumBits[i];
histNumBits[i] = 0;
}

// compute Kraft sum
// (using integer math: everything is multiplied by 2^newMaxLength)
unsigned long long total = 0;
for (i = newMaxLength; i > 0; i--)
total += histNumBits[i] << (newMaxLength - i);

// iterate till Kraft sum doesn't exceed 1 anymore
unsigned prolonged prolonged one = 1ULL << newMaxLength;
while (total > one)
{
// select a code with maximum dimension, this might be moved
histNumBits[newMaxLength]--;

// derive a 2nd code with shorter dimension
for (i = newMaxLength - 1; i > 0; i--)
if (histNumBits[i] > 0)
{
histNumBits[i]--;
// prolong that shorter code by one bit
// and repair the same code dimension to the chosen one
histNumBits[i + 1] += 2;

damage;
}

// transferring these codes reduced the Kraft sum
total--;
}

return newMaxLength;
}

BZip2's Length-Limiting Algorithm

bzip2 is a extraordinary compression theory. One of its quite a bit of stages is a Huffman compression stage.


As a replace of playing around with the Kraft-McMillan inequality it entirely depends on a hasty Huffman encoder and trial'n'error:

  • compute Huffman codes to your alphabet
  • if maximum code dimension doesn't exceed maxLength then you may perhaps perchance perhaps very successfully be done
  • otherwise lower each and every image's count (but withhold a long way flung from turning into zero)
  • and return to step 1

There are many potentialities the style to put in force step 3. bzip2 divides by two and clears the lower 8 bits of every and every counter (and provides 1 to withhold a long way flung from zeros).


While being very hasty, it would also simply create sub-optimal Huffman codes. Now and again dividing by 3 presents higher results but that entirely depends to your enter files.

The core loop is:



veil

bzip2


// sprint Moffat algorithm ...
unsigned char end result = moffatSortedInPlace(numNonZero, sorted);
// ... till a lawful maximum code dimension is found out
while (end result > maxLength)
{
// roughly divide each and every weight by two while warding off zero
for (i = 0; i < numNonZero; i++)
{
unsigned int weight = mapping[i].key;

// bzip2 "clears" the lowest 8 bits of the histogram
// to reach the length limit in less iterations
// but sacrifices lots of efficiency
// if you set EXTRA_SHIFT to 0 then the code may need more iterations
// but finds much better code lengths
//#define EXTRA_SHIFT 8
#define EXTRA_SHIFT 0

// sometimes dividing the weight by a bigger integer (e.g. 3)
// may lead to more efficient prefix codes
#define DIVIDE_BY 2

// the shifts do more harm than good in my opinion
weight >>= EXTRA_SHIFT;

// including 1 avoids zero
weight = 1 + (weight / DIVIDE_BY);
weight <<= EXTRA_SHIFT;

// sorted develop into as soon as overwritten with code lengths
mapping[i].key = sorted[i] = weight;
}

// again: sprint Moffat algorithm
end result = moffatSortedInPlace(numNonZero, sorted);
}

The compression efficiency is in virtually all cases worse that Kit-Merge/JPEG/MiniZ. Moreover, it is slower than JPEG/MiniZ so there isn't the kind of thing as a real motive selecting bzip2's capacity.

Kraft-McMillan Approximations

The aim of every and every presented dimension-limiting algorithm is producing a code that is successfully matched to Huffman decoders.


Rather then Kit-Merge, all of them must generate a straightforward Huffman code and then alter overlong symbols.

I already talked about the Kraft-McMillan inequality which states that every and every code where sum(2^-codeLength[0...numCodes]) is a prefix-free code.


That capacity each and every such code might be decoded by Huffman decoders, too.

I came all over a pair of postings, mainly by Charles Bloom, discussing how fully focussing on the Kraft-McMillan inequality might be each and every hasty and shut to optimal.


Nonetheless, I made a pair of indispensable changes to walk up issues:

  1. compute theoretical quantity of bits for every and every image: entropy = -log2(count / total)
  • round up or accurate down to closest integer, right here's our initial wager for the code dimension
    • but under no circumstances round to zero
    • if above our dimension restrict, then establish of dwelling it to that restrict
  • compute Kraft-McMillan sum of all symbols' code lengths
  • Now we've a rough estimation. If the Kraft-McMillan sum doesn't exceed 1 then right here's truly a lawful prefix-free code (but presumably it is sub-optimal).


    If that sum is above 1 then we must alter (=amplify) a pair of code lengths. The principle opinion is editing these code lengths which "damage the least".


    When we rounded to the closest integer, many symbols rounded down: their proper code dimension is smaller than their theoretical bit dimension.

    Clutch you may perhaps perchance perhaps even bear a image which represents 9% of all symbols. Then its entropy is -log2(0.09) ≈ 3.474 → rounded to 3. There is a "own" of 3.474 - 3 = 0.474.


    "Fabricate" capacity that we old style 0.474 much less bits per image occurrence than the theoretical minimal, we saved some bits.


    If we replace that image's code dimension to 4 (assuming it is allowed by the dimensions restrict), then there might be a "loss" of 3.474 - 4 = -0.526.


    "Loss" capacity that we old style 0.526 much less bits per image occurrence than the theoretical minimal, we wasted some bits.


    The Kraft-McMillan sum might be lower though, optimistically now under 1, at the worth of rather worse compression.

    Shall we embrace there is any other image representing 10% of all symbols, thus being a diminutive bit more frequent than the previous image.


    It is entropy is -log2(0.1) ≈ 3.322, rounded to 3 as successfully. Attributable to this reality it is "own" is 3.322 - 3 = 0.322 and doubtless "loss" after including one bit is 3.322 - 4 = -0.678.


    Both symbols bear an designate on the Kraft-McMillan sum within the same capacity, but the latter image will motive even worse compression.

    You bear perchance observed that including one bit to image which had a own of x results in a loss of 1-x.


    So the most most considerable opinion is: repeatedly glimpse symbols with maximum own to motive minimal loss after including rather.

    I utilized two somewhat a pair of suggestions to change the code lengths, known as A and B.


    And inch, I obtained uninterested within the prolonged title "Kraft-McMillan inequality" and upright known because it "Kraft" (which is the good German word for "energy"/"energy"/"force").

    Kraft - Formula A

    This algorithm works with quite a bit of passes. A threshold, within the origin a little payment, decreases after each and every stream.


    Walking thru all symbols, we glimpse symbols where the adaptation between theoretical bit dimension and proper bit dimension is above that threshold ("excessive own").


    One bit might be added and the Kraft sum adjusted. As soon as the Kraft sum isn't elevated than 1 anymore we've a legit prefix-free code.

    It is that you may perhaps perchance perhaps presumably be also enlighten of to "overshoot" by extending too many previously very short codes - which bear a excessive affect on the Kraft sum.


    If we are successfully under a Kraft sum of 1 then a single stream looks for codes that might be made one bit shorter without violating the Kraft sum.


    In many cases this closing stream isn't wanted and might be skipped.

    The algorithm works quite successfully. Nonetheless, the quantity of passes massively depends on the enter files.


    If the initial threshold is fair too excessive and/or shrinks finest a diminutive after each and every stream then an ideal quantity of passes might be required.


    If the initial threshold is fair too low and/or shrinks loads after each and every stream then finest few passes might be required.


    While the latter sounds considerably higher, this can scoot away out many optimal code lengths and create sub-optimal results.


    I did not derive a supreme one-suits-all configuration. The default INITIAL_THRESHOLD = 28 / 64.0 and STEP_THRESHOLD = 1 / 64.0 worked successfully in my experiments.

    Kraft - Formula B

    Formula A makes a pair of guesses in the case of the threshold. Moreover, it again and again scans symbols which have to not be modified.


    My approach B implements a max-queue. Its first ingredient is repeatedly the image with the absolute top own.


    Attributable to this reality we don't prefer scan all symbols but repeatedly select from the head of that max-queue.


    After a image develop into as soon as adjusted, there is a must re-insert it into the max-queue.

    One of many most atmosphere friendly files buildings for such score admission to sample is a heap. Most regular libraries advance with a heap implementation (reminiscent of C++).


    Roughly half of the code of limitedkraftheap.c is a pretty generic and straightforward heap implementation.

    Corresponding to approach A, we would also "overshoot" in remark that a hasty repair-up exists as successfully.

    There may perhaps be not any vital incompatibility between approach A and B referring to the generated code lengths.


    Nonetheless, approach B is mighty sooner (generally by a element of 4) and even outperformes an atmosphere friendly Huffman implementation.


    It is attention-grabbing to pronounce that the dimensions restrict has a truly indispensable impact on the execution time: lower limits are sooner.

    License

    This code is licensed under the zlib License:

    This utility is geared up 'as-is', with none remark or implied
    warranty. In no tournament will the authors be held liable for any damages
    increasing from the utilization of this utility.

    Permission is granted to somebody to make employ of this utility for any purpose,
    including industrial applications, and to alter it and redistribute it
    freely, field to the following restrictions:

    1. The starting establish of this utility must never be misrepresented; you want not
    claim that you wrote the customary utility. If you happen to make employ of this utility
    in a product, an acknowledgment within the product documentation may perhaps perchance perhaps presumably be
    liked but isn't required.
    2. Altered source variations must be it appears that marked as such, and must never be
    misrepresented as being the customary utility.
    3. This inspect may perhaps perchance perhaps also simply not be removed or altered from any source distribution.zlib License

    Git customers: scroll down to the repository link

    Most up-to-date liberate: February 5, 2021, dimension: 1668 bytes, 30 strains

    CRC32: 9f50f131

    MD5: 9c4358817e4f951186dc6c87866ab727

    SHA1: 5533b6f2fa58c109c590ae4f33665ac7d3e4b896

    SHA256: beb1da8cdce260cb540ab871c5f98d46b596e3956d61324fd1567739492384f8

    Most up-to-date liberate: June 29, 2021, dimension: 11.0 kBytes, 338 strains

    CRC32: b70bd1e5

    MD5: e826eda6c202bfa2fa22daf9e3a5a246

    SHA1: fc47f5ef18d5c82996032b3d3d9c83a85df8f940

    SHA256: 2dec5ba9d228eeb44bab2f90312518f546475a65982566857bc8ffe402c42a34

    Most up-to-date liberate: February 5, 2021, dimension: 2208 bytes, 41 strains

    CRC32: 979b7a5c

    MD5: 264867286a2a10419e77ad7e2da9ee69

    SHA1: 2cebbac1ae3c7f53ee7562ed73b9e68ca4987dff

    SHA256: 04e45e82ad87c9bb938d7e68b9a39346dbe6d1c8ff9e5859894be43ec62ba406

    Most up-to-date liberate: February 10, 2021, dimension: 4436 bytes, 175 strains

    CRC32: a5935e36

    MD5: db5f8486daebbbe7179c2ed964bd7a80

    SHA1: b688a68b47c8ef654e3271a78152a5382c480820

    SHA256: 1ec47ed8df361a5cb8ef847b94dc1be85fccfaaab2c08d7d56f4dd551d8b31eb

    Most up-to-date liberate: February 5, 2021, dimension: 4835 bytes, 80 strains

    CRC32: 161095e9

    MD5: 715a76cfe7bb7d74e8c1e5f2fbcfb4d9

    SHA1: e0f8bd3a6b3afcd6a793fa8daab8b7b34db92cf4

    SHA256: e75a18a20fe66cee275a0d0ebe52cca3bf9ad14c92994c3a11d5a68651d98819

    Most up-to-date liberate: February 10, 2021, dimension: 12.2 kBytes, 356 strains

    CRC32: c503747c

    MD5: e878141ab902742613f9863e97118ec0

    SHA1: e6a180b59caf39df790cc047876392c65c114232

    SHA256: 0e85a8d136f07214b4203a4036a75a3f817e64f1edebf7028b3df55217924ed6

    Most up-to-date liberate: February 5, 2021, dimension: 1476 bytes, 29 strains

    CRC32: c958c820

    MD5: 5735c2438ffbd21540c640e48a8217fd

    SHA1: e157bb74cc746609ae083ceb497bf4327f31f016

    SHA256: 39a411f02183cc34a8559f973724517a6d81c68c5259478f0a05fdb8fb66658a

    Most up-to-date liberate: February 10, 2021, dimension: 4050 bytes, 131 strains

    CRC32: 821a4f0f

    MD5: af0a84f35b32310ec33ad6da058ff3e4

    SHA1: a1eff36323c9917f62f0cfd1ee000388c44a4c1e

    SHA256: a8bfb557d645a1419df9fe071682d9be88af71c82905bddfb0f3d10e50964a78

    Most up-to-date liberate: February 5, 2021, dimension: 745 bytes, 17 strains

    CRC32: cb6d4745

    MD5: 9f5615706645e6d591769cd9385c90bd

    SHA1: e8e526e2e32b16197ec77c593f476230221217a6

    SHA256: 8ee50def030ba1dd2383febc970496c1f7b9cfe824774d133b5f75cc0741b641

    Most up-to-date liberate: February 5, 2021, dimension: 6.9 kBytes, 224 strains

    CRC32: fddbc6e9

    MD5: a210caef2d00f1be7c9da18b193df501

    SHA1: 648b90061f069d74beb41decdcd7ed28c4eaffef

    SHA256: 545e870024999e917bd52347419872458178f61645853e5360bf04f35e2aa63c

    Most up-to-date liberate: February 5, 2021, dimension: 753 bytes, 17 strains

    CRC32: 7a4c7aa3

    MD5: f920b31d7c745895287ae07e357b0737

    SHA1: 7bf007989f4ee0ec9e6af857038fa52bf7a5b59f

    SHA256: c12082e6be08b9028d80ad2f148ecfb780ea4daf9ee01b9569e7f30553dba489

    Most up-to-date liberate: February 5, 2021, dimension: 9.7 kBytes, 342 strains

    CRC32: 0a9bb4f3

    MD5: 1d5fcb52464eb0ae0922ec43dddf862f

    SHA1: 99e13c8494a78e5c4d602abb124da76e2064d049

    SHA256: 3a4559d076c85703d1936b4ee09a9d1b2fbc71ab73ddf20ee9805ab29420fc56

    Bring together up-to-date: git clone https://produce.stephan-brumme.com/dimension-dinky-prefix-codes/git

    If you happen to advance all over any bugs/considerations or bear suggestions for boosting future variations, please write me an e-mail:
    [email protected]

    Changelog

    • version 1
      • June 30, 2021
      • initial liberate

    Be taught Extra

    Leave a Reply

    Your email address will not be published. Required fields are marked *