Length-Dinky Prefix Codes

Overview

Utilization

Benchmark

Package-Merge

Kraft-McMillan Inequality

Overview

Utilization

Benchmark

Package-Merge

Kraft-McMillan Inequality

JPEG's Length-Limiting Intention

MiniZ's Length-Limiting Algorithm

BZip2's Length-Limiting Algorithm

Kraft-McMillan Approximations

License

Offer Code

Changelog

Kraft - Formula A

Kraft - Formula B

Leave a Reply Cancel reply

posted June 30, 2021 by Stephan Brumme

Essentially the most regular methodology to generate prefix-free codes is the Huffman algorithm.

It is allotment of many standard compression algorithms reminiscent of JPEG,
DEFLATE (ZIP /
GZIP / PNG / etc.).

Typically it serves as an “afterburner” to compress pre-processed symbols.

The principle advantage – variable-dimension codes – is known as a excessive downside, too: if the utmost code dimension exceeds a definite threshold,
e.g. the CPU’s integer dimension, (16 / 32 / 64 bit), then the compression/decompression code becomes advanced and slows down.

All of the aforementioned file formats place a strict higher restrict on the utmost dimension of Huffman codes.

Mainly on account of their age and low quantity of distinct symbols, all of them agreed on 16 bits per code.

So-known as dimension-dinky prefix codes will not be Huffman codes anymore because they place not appear to be optimal (if the dimensions restriction develop into as soon as enforced).

The loss in compression efficiency turns out to be in most cases negligible and thus is widely authorized on account of the quite a bit of own in decompression performance.

There may perhaps be somewhat a pair of algorithms to present such dimension-dinky prefix codes. They vary referring to walk, efficiency and complexity.

I wrote a C library imposing the commonest methods. Many of the code derives from open source programs and develop into as soon as invented by others.

My vital contribution develop into as soon as to present a shared interface in remark that these algorithms are simply interchangeable:

veil

interface

unsigned char algorithmName(unsigned char maxLength, unsigned int numCodes, const unsigned int histogram[], unsigned char codeLengths[]);

There are three enter parameters:

maxLength is the upper restrict of the quantity of bits for every and every prefix code
numCodes is the quantity of distinct symbols (including unused symbols) → dimension of your alphabet
histogram is an array of numCodes counters how in most cases each and every image is portray

and one output parameter:

codeLengths will have the bit dimension of every and every image

The return payment is:

the longest bit dimension or zero if the algorithm failed

If you happen to like to wish to encode the sequence 0 3 1 1 2 5 2 2 then numCodes = 6 and histogram[] = { 1,2,3,1,0,1 };

Nonetheless, this shared interface comes with a diminutive bit of overhead: each and every so again and again doubling code dimension and/or execution time.

Attributable to this reality most files bear a 2nd public characteristic which is more specialized for its algorithm but may perhaps perchance perhaps also simply bear a pair of further restrictions.

An everyday case is that the histogram must be sorted and unused symbols removed.

For the time being utilized are:

Every algorithm runs self sustaining of every and every other. There don’t appear to be any exterior dependencies, too.

Some algorithms require pre-computed Huffman codes – for these cases I added a rather modified version of
Andrew Moffat’s atmosphere friendly in-establish of dwelling Huffman code.

If you happen to like to wish to confirm out out an algorithm then all you want present is:

veil

Utilization

#consist of "packagemerge.h" #consist of int vital() { // let's own you may perhaps perchance perhaps even bear 7 symbols unsigned int numCodes = 7; // first one happens 270 cases, 2nd 20x, third 10x, fourth under no circumstances, fifth as soon as, sixth 6x and seventh as soon as unsigned int histogram[7] = { 270,20,10,0,1,6,1 }; // at most 4 bits unsigned char maxLength = 4; // output: bits per code unsigned char codeLengths[7]; // compute dimension-dinky codes unsigned char longest = packageMerge(maxLength, numCodes, histogram, codeLengths); // pronounce end result for (unsigned int i = 0; i < numCodes; i++) printf("code %d occurs %d times => %d bitsn", i, histogram[i], codeLengths[i]); printf("max. code dimension is %dn", longest); return 0; }

When in comparison with the customary Huffman algorithm, a pair of code grew to change into shorter and a pair of grew to change into longer.

The amount of bits / codeLengths are:

Symbol	0	1	2	3	4	5	6
Depend	270	20	10	0	1	6	1	sum = 308
Huffman	1	2	3	0	5	4	5	sum = 374 bits
`maxLength = 4`	1	2	4	0	4	4	4	sum = 382 bits
`maxLength = 3`	2	2	3	0	3	3	3	sum = 634 bits

As you may perhaps perchance perhaps look, limiting the utmost code lengths hurts the compression ratio.

There are quite a bit of methods to rework code lengths to proper prefix-free codes.
I derive canonical Huffman codes to be the most upright: a straightforward and hasty algorithm.

Utilizing the instance files we score these canonical codes:

Symbol	0	1	2	3	4	5	6
Depend	270	20	10	0	1	6	1
Huffman	`0`	`10`	`110`	–	`11110`	`1110`	`11111`
`maxLength = 4`	`0`	`10`	`1100`	–	`1101`	`1110`	`1111`
`maxLength = 3`	`00`	`01`	`100`	–	`101`	`110`	`111`

An everyday program exhibiting the style to compute canonical codes (click on to open):

portray

canonical.c

#consist of "packagemerge.h" #consist of // produce canonical codes, for simplicity longest code must never exceed quantity of bits of an unsigned int (32 or 64) void canonical(unsigned int numCodes, unsigned char codeLengths[], unsigned int codes[]) { // count how in most cases each and every code dimension is portray unsigned int numLengths[256] = { 0 }; unsigned int i; for (i = 0; i < numCodes; i++) numLengths[codeLengths[i]]++; // precompute the most most considerable code of every and every code dimension unsigned int subsequent[256]; // there isn't any code with dimension 0 and the most most considerable 1-bit code with be a zero subsequent[1] = 0; for (i = 2; i < 256; i++) subsequent[i] = (subsequent[i - 1] + numLengths[i - 1]) << 1; // now attach all codes for (i = 0; i < numCodes; i++) codes[i] = subsequent[codeLengths[i]]++; } int vital() { // let's own you may perhaps perchance perhaps even bear 7 symbols unsigned int numCodes = 7; // first one happens 270 cases, 2nd 20x, third 10x, fourth no at all, fifth as soon as, sixth 6x and seventh as soon as unsigned int histogram[7] = { 270,20,10,0,1,6,1 }; // at most 4 bits unsigned char maxLength = 4; // output: bits per code unsigned char codeLengths[7]; // compute dimension-dinky codes unsigned char longest = packageMerge(maxLength, numCodes, histogram, codeLengths); // produce canonical codes unsigned int codes[7]; canonical(numCodes, codeLengths, codes); // pronounce end result unsigned int i; for (i = 0; i < numCodes; i++) { printf("code %d occurs %d timest=> %d bits", i, histogram[i], codeLengths[i]); // binary represenation of canonical code printf("t=> "); if (codeLengths[i] > 0) { int shift; for (shift = codeLengths[i] - 1; shift >= 0; shift--) printf(codes[i] & (1 << shift) ? "1" : "0"); } printf("n"); } printf("max. code length is %dn", longest); return 0; }

Here are a few results from the first 64k bytes of enwik, measured on a Core i7 / GCC x64:

time ./benchmark 1 12 100000

where 1 is the algorithm’s ID and was between 0 and 6
limiting the code length to at most 12 bits

note: the unadjusted Huffman codes have up to 16 bits

each algorithm ran 100000 times

For comparison: the uncompressed data has 64k bytes = 524,288 bits:

algorithm	ID	total size	compression ratio	execution time
Huffman	`0`	326,892 bits	62.35%	0.54 s
Package-Merge	`1`	327,721 bits	62.51%	1.17 s
MiniZ	`2`	328,456 bits	62.65%	0.58 s
JPEG	`3`	328,456 bits	62.65%	0.61 s
BZip2	`4`	328,887 bits	62.73%	0.88 s
Kraft	`5`	327,895 bits	62.54%	1.72 s
modified Kraft	`6`	327,942 bits	62.55%	0.41 s

Remember: each algorithm was repeated 100,000 times, so a single iteration finished in about 4 to 17 microseconds.

Changing the code length limit significantly influences both execution time and compression ratio.

I picked the two “best” algorithms, used the same data set and repeated each procedure 100,000 times:

length limit	Package-Merge	Kraft Strategy B
8 bits	70.47%, 0.96 s	70.76%, 0.24 s
9 bits	65.30%, 1.02 s	65.31%, 0.24 s
10 bits	63.49%, 1.07 s	63.79%, 0.31 s
11 bits	62.80%, 1.14 s	62.84%, 0.37 s
12 bits	62.51%, 1.17 s	62.55%, 0.40 s
13 bits	62.40%, 1.22 s	62.43%, 0.34 s
14 bits	62.36%, 1.25 s	62.42%, 0.40 s
15 bits	62.35%, 1.29 s	62.42%, 0.66 s
16 bits	62.35%, 1.35 s	62.42%, 0.70 s

For comparison: Moffat’s Huffman algorithm needs 0.55 seconds and its longest prefix code has 16 bits.

Package-Merge is an algorithm that guarantees to produce optimal length-limited prefix codes.

Its name is based on the conception of putting numbers in packages and merging these.

Unfortunately, the original paper by Lawrence L. Larmore and Daniel S. Hirschberg isn’t very accessible.

Moreover, further information is pretty scarce on the internet: the Wikipedia entry doesn’t help a lot.

The best introduction – in my opinion – is Sebastian Gesemann’s Bachelor’s Thesis but it’s available in German only.

The diagrams on page 24 and 26 walk through an example and page 29 gives pseudo code.

There are two prerequisites:

remove all unused symbols before running Package-Merge
and symbols must must be sorted in ascending order of their frequency

Let’s process the sample data shown above:

Symbol	0	1	2	3	4	5	6
Count	270	20	10	0	1	6	1

Obviously, the prerequisites aren’t fulfilled yet: unused symbol 3 needs to be removed and the remaining symbols must be in ascending order of their count:

(the order of symbols with the same count doesn’t matter)

Symbol	4	6	5	2	1	0
Count	1	1	6	10	20	270

Since there were 6 symbols (plus an unused one), we need at least 3 bits to encode them.

Naturally you can’t have an active symbol with zero bits, the minimum code length is 1 bit.

Package-Merge adds one bit in each iteration so there is a minimum of two iterations (2+1=3) but for whatever reason I decided to have this example with maxLength=4.

Now we add these counts pairwise, we build packages. If there is an odd number of symbols then the last (which is the largest) is discarded:

Symbol	4	6	5	2	1	0
Count	1	1	6	10	20	270
Pairwise Sum	2		16		290

These packages have to be merged with the original sequence in ascending order.

If a package is identical to a count of the original sequence then that item from the original sequence comes first.

In the next table packages are shown in a bold font:

Count	1	1	2	6	10	16	20	270	290

This was one iteration of phase 1. Each iteration represents one bit.

Therefore we need to repeat that procedure two more times to reach our goal maxLength=4.

However, there is one little pitfall: pairs are made using the sequence from the previous step but you always merge them with the original sequence.

That last sentence is important and was probably my most critical insight when I learned about the Package-Merge algorithm.

Okay, let’s create package for the second bit:

Count	1	1	2	6	10	16	20	270	290
Pairwise Sum	2		8		26		290

As you can see, there is an odd number of items and thus the last item (290 in the upper row) must be discarded.

The four values in the lower row are merged with the original sequence:

Count	1	1	2	6	8	10	20	26	270	290

One more time “packaging”:

Count	1	1	2	6	8	10	20	26	270	290
Pairwise Sum	2		8		18		46		560

And merging:

Count	1	1	2	6	8	10	18	20	46	270	560

Phase 1 has completed. Let’s gather all intermediate results from each iteration in one table:

Original	1	1	6	10	20	270
Iteration 1	1	1	2	6	10	16	20	270	290
Iteration 2	1	1	2	6	8	10	20	26	270	290
Iteration 3	1	1	2	6	8	10	18	20	46	270	560

Phase 2 processes the output of each iteration, going from the bottom row to the top row.

The actual values in the table don’t matter – it’s only relevant whether a value is a package (bold) or not (plain).

Code lengths for all symbols are initially set to zero.

If a value isn’t bold then the code length of a symbol has to be incremented.

The first iteration processes the first 2*numCodes-2=10 values of the bottom row, that means 1,1,...,270 but not 560.

Two counters symbol and numMerged are set to zero.

For each table entry a decision has to be made:

if that value is a package (bold) then increment numMerged
otherwise increment codeLengths[symbol] and then symbol

The next iteration has to look at the first 2*numMerged table entries.

Let’s go step-by-step through the example – as mentioned before we start with the bottom row.

The code length of all symbols was set to zero.

What happened ? We looked at ten values of iteration 3 of the first phase:

a 1 which wasn’t bold → codeLength[0] = 1 and then symbol = 1
another 1 which wasn’t bold → codeLength[1] = 1 and then symbol = 2
a 2 which is bold → numMerged = 1
a 6 which wasn’t bold → codeLength[2] = 1 and then symbol = 3
a 8 which is bold → numMerged = 2
a 10 which wasn’t bold → codeLength[3] = 1 and then symbol = 4
a 18 which is bold → numMerged = 3
a 20 which wasn’t bold → codeLength[4] = 1 and then symbol = 5
a 46 which is bold → numMerged = 4
a 270 which wasn’t bold → codeLength[5] = 1 and then symbol = 6

Somehow that wasn’t very exciting … all code lengths changed from zero to one.

Because the final value of numMerged is 4, we look at the first 4*2=8 values of the second-to-last row.

Its content was (just a copy, I stroke through irrelevant entries):

Iteration 2	1	1	2	6	8	10	20	26	~~270~~	~~290~~

This time we cope only with three bold entries and five non-bold.

That means numMerged = 3 and the code length of all-but-the-last symbol becomes 2.

That last symbol, which had the highest count, suddenly gets assigned a lower code length than all the other symbols.

Again, a copy of phase 1’s iteration 1 (where only 2*numMerged = 6 will be handled):

Iteration 1	1	1	2	6	10	16	20	~~270~~	~~290~~

There are two bold entries (merged packages) and four non-bold.

You probably already understand that the first four symbol’s code length needs to be incremented and

we will look at the first 2*numMerged = 4 numbers of the final row (the top row).

Obviously there can’t be any bold values.

The first four symbol’s code length is incremented again. Let’s take a look at the code lengths of each step:

(array index)	0	1	2	3	4	5
Symbol	4	6	5	2	1	0
Count	1	1	6	10	20	270
Code Length (initially)	0	0	0	0	0	0
Code Length (after processing bottom row / iteration 3)	1	1	1	1	1	1
Code Length (after processing iteration 2)	2	2	2	2	2	1
Code Length (after processing iteration 1)	3	3	3	3	2	1
Code Length (after processing top row)	4	4	4	4	2	1

That’s it ! The bottom code contains the final code lengths.

No symbol’s number of bits is higher than maxLength=4 and the Kraft-McMillan inequality holds true as well: 2^-4 + 2^-4 + 2^-4 + 2^-4 + 2^-2 + 2^-1 = 1.

The algorithm is guaranteed to create optimal code lengths for a prefix-free code. No other combination of code lengths will be better.

I left out any theoretical stuff, such as proofs etc., and skipped practical optimizations to keep things short and sweet.

Take a look at my code (click below) or read the scientific papers to learn more details.

show

packagemerge.c

// ////////////////////////////////////////////////////////// // packagemerge.c // written by Stephan Brumme, 2021 // see https://create.stephan-brumme.com/length-limited-prefix-codes/ // #include "packagemerge.h" #include // malloc/free/qsort // ----- package-merge algorithm ----- // to me the best explanation is Sebastian Gesemann's Bachelor Thesis (in German only / University of Paderborn, 2004) // data types (switching to unsigned int is faster but fails if sum(histogram) > 2^31 or maxLength > 31) typedef unsigned prolonged prolonged BitMask; typedef unsigned prolonged prolonged HistItem; /// compute dinky prefix code lengths according to Larmore/Hirschberg's equipment-merge algorithm /- histogram must be in ascending pronounce and no entry must be zero - the characteristic rejects maxLength > 63 but I don't look any gleaming reasons you'd desire a elevated restrict ... @param maxLength maximum code dimension, e.g. 15 for DEFLATE or JPEG @param numCodes quantity of codes, equals the array dimension of histogram and codeLength @param A [in] how in most cases each and every code/image develop into as soon as found out [out] computed code lengths @end result proper maximum code dimension, 0 if error */ unsigned char packageMergeSortedInPlace(unsigned char maxLength, unsigned int numCodes, unsigned int A[]) { // skip zeros while (numCodes > 0 && A[0] == 0) { numCodes--; A++; } // at the least one code desires to be in employ if (numCodes == 0 || maxLength == 0) return 0; // one or two codes are repeatedly encoded with a single bit if (numCodes <= 2) { A[0] = 1; if (numCodes == 2) A[1] = 1; return 1; } // A[] is an input parameter (stores the histogram) as well as // an output parameter (stores the code lengths) const unsigned inthistogram = A; unsigned intcodeLengths = A; // my allround variable for various loops unsigned int i; // check maximum bit length if (maxLength > 8*sizeof(BitMask) - 1) // 8*8-1 = 63 return 0; // at the least log2(numCodes) bits required for every and every legit prefix code unsigned prolonged prolonged encodingLimit = 1ULL << maxLength; if (encodingLimit < numCodes) return 0; // need two buffers to direction of iterations and an array of bitmasks unsigned int maxBuffer = 2 numCodes; // allocate memory HistItempresent = (HistItem*) malloc(sizeof(HistItem) maxBuffer); HistItemprevious = (HistItem*) malloc(sizeof(HistItem) maxBuffer); BitMaskisMerged = (BitMask*) malloc(sizeof(BitMask) maxBuffer); // initial payment of "previous" is a straightforward replica of the sorted histogram for (i = 0; i < numCodes; i++) previous[i] = histogram[i]; unsigned int numPrevious = numCodes; // no must initialize "present", it is entirely rebuild each and every iteration // withhold word which applications are merged (compact bitmasks): // if equipment p develop into as soon as merged in iteration i then (isMerged[p] & (1 << i)) != 0 for (i = 0; i < maxBuffer; i++) isMerged[i] = 0; // there are no merges before the first iteration // the last 2 packages are irrelevant unsigned int numRelevant = 2 numCodes - 2; // ... and preparation is finished // ////////////////////////////////////////////////////////////////////// // iterate through potential bit lengths while packaging and merging pairs // (step 1 of the algorithm) // - the histogram is sorted (prerequisite of the function) // - the output must be sorted, too // - thus we have to copy the histogram and every and then insert a new package // - the code keeps track of the next package and compares it to // the next item to be copied from the history // - the smaller value is chosen (if equal, the histogram is chosen) // - a bitmask named isMerged is used to keep track which items were packages // - repeat until the whole histogram was copied and all packages inserted // bitmask for isMerged BitMask mask = 1; unsigned char bits; for (bits = maxLength - 1; bits > 0; bits--) { // ignore closing ingredient if numPrevious is atypical (can't be paired) numPrevious &= ~1; // bit-twiddling trick to clear the bottom bit, same as numPrevious -= numPrevious % 2 // first merged equipment present[0] = histogram[0]; // a sum can't be smaller than its system present[1] = histogram[1]; // therefore it is very unlikely to search out a equipment at index 0 or 1 HistItem sum = present[0] + present[1]; // same as previous[0] + previous[1] // replica histogram and insert merged sums whenever that you may perhaps perchance perhaps presumably be also enlighten of unsigned int numCurrent = 2; // present[0] and present[1] had been already establish of dwelling unsigned int numHist = numCurrent; // we took them from the histogram unsigned int numMerged = 0; // but thus a long way no equipment inserted (on the opposite hand, it is precomputed in "sum") for (;;) // end/damage is within the loop { // the subsequent equipment isn't higher than the subsequent histogram merchandise ? if (numHist < numCodes && histogram[numHist] <= sum) { // copy histogram item current[numCurrent++] = histogram[numHist++]; continue; } // okay, we have a package being smaller than next histogram item // mark output value as being "merged", i.e. a package isMerged[numCurrent] |= mask; // store package current [numCurrent] = sum; numCurrent++; // already finished last package ? numMerged++; if (numMerged 2 >= numPrevious) damage; // precompute subsequent sum sum = previous[numMerged 2] + previous[numMerged 2 + 1]; } // make sure each and every code from the histogram is included // (relevant if histogram may perhaps be very skewed with a pair of outliers) while (numHist < numCodes) present[numCurrent++] = histogram[numHist++]; // put together subsequent conceal conceal <<= 1; // performance tweak: abort as soon as "previous" and "current" are identical if (numPrevious >= numRelevant) // ... at the least their relevant system { // fundamentally a bool: FALSE == 0, TRUE == 1 char keepGoing = 0; // review each and every arrays: if they're same then end for (i = numRelevant - 1; i > 0; i--) // collisions are presumably at the pause if (previous[i] != present[i]) { keepGoing++; damage; } // early exit ? if (keepGoing == 0) damage; } // swap pointers "previous" and "present" HistItemtmp = previous; previous = present; present = tmp; // no must swap their sizes because finest numCurrent wanted in subsequent iteration numPrevious = numCurrent; } // shifted one bit too a long way conceal >>= 1; // withhold finest isMerged free(previous); free(present); // ////////////////////////////////////////////////////////////////////// // tracking all merges will create the code lengths // (step 2 of the algorithm) // - analyze each and every bitlength's conceal in isMerged: // a "pure" image => amplify bitlength of that image // a merged code => upright amplify counter // - end if no more merged codes found out // - if m merged codes had been found out then finest word // the most most considerable 2*m system within the subsequent iteration // (because finest they fashioned these merged codes) // reset code lengths for (i = 0; i < numCodes; i++) codeLengths[i] = 0; // initiate with examining the most most considerable 2n-2 values unsigned int numAnalyze = numRelevant; while (conceal != 0) // stops if nothing but symbols are found out in an iteration { // quantity of merged applications seen thus a long way unsigned int numMerged = 0; // the most most considerable two system must be symbols, they may be able to't be applications codeLengths[0]++; codeLengths[1]++; unsigned int image = 2; // glimpse at applications for (i = image; i < numAnalyze; i++) { // check bitmask: not merged if bit is 0 if ((isMerged[i] & mask) == 0) { // we have a single non-merged symbol, which needs to be one bit longer codeLengths[symbol]++; symbol++; } else { // we have a merged package, so that its parts need to be checked next iteration numMerged++; } } // look only at those values responsible for merged packages numAnalyze = 2 numMerged; // note that the mask was originally slowly shifted left by the merging loop mask >>= 1; } // closing iteration can't bear any merges for (i = 0; i < numAnalyze; i++) codeLengths[i]++; // it's a free world ... free(isMerged); // first symbol has the longest code because it's the least frequent in the sorted histogram return codeLengths[0]; } // the following code is almost identical to function moffat() in moffat.c // helper struct for qsort() struct KeyValue { unsigned int key; unsigned int value; }; // helper function for qsort() static int compareKeyValue(const voida, const voidb) { struct KeyValueaa = (struct KeyValue*) a; struct KeyValuebb = (struct KeyValue*) b; // negative if a < b, zero if a == b, positive if a > b if (aa->key < bb->key) return -1; if (aa->key > bb->key) return +1; return 0; } /// same as sooner than but histogram might be in any pronounce and may perhaps perchance perhaps simply have zeros, the output is kept in a dedicated parameter /- the characteristic rejects maxLength > 63 but I don't look any gleaming reasons you'd desire a elevated restrict ... @param maxLength maximum code dimension, e.g. 15 for DEFLATE or JPEG @param numCodes quantity of codes, equals the array dimension of histogram and codeLength @param histogram how in most cases each and every code/image develop into as soon as found out @param codeLength [out] computed code lengths @end result proper maximum code dimension, 0 if error */ unsigned char packageMerge(unsigned char maxLength, unsigned int numCodes, const unsigned int histogram[], unsigned char codeLengths[]) { // my allround variable for assorted loops unsigned int i; // reset code lengths for (i = 0; i < numCodes; i++) codeLengths[i] = 0; // count non-zero histogram values unsigned int numNonZero = 0; for (i = 0; i < numCodes; i++) if (histogram[i] != 0) numNonZero++; // reject an empty alphabet because malloc(0) is undefined if (numNonZero == 0) return 0; // allocate a buffer for sorting the histogram struct KeyValuemapping = (struct KeyValue*) malloc(sizeof(struct KeyValue) numNonZero); // replica histogram to that buffer unsigned int storeAt = 0; for (i = 0; i < numCodes; i++) { // skip zeros if (histogram[i] == 0) continue; mapping[storeAt].key = histogram[i]; mapping[storeAt].payment = i; storeAt++; } // now storeAt == numNonZero // invoke C regular library's qsort qsort(mapping, numNonZero, sizeof(struct KeyValue), compareKeyValue); // extract ascendingly ordered histogram unsigned intsorted = (unsigned int*) malloc(sizeof(unsigned int) numNonZero); for (i = 0; i < numNonZero; i++) sorted[i] = mapping[i].key; // sprint equipment-merge algorithm unsigned char end result = packageMergeSortedInPlace(maxLength, numNonZero, sorted); // "unsort" code lengths for (i = 0; i < numNonZero; i++) codeLengths[mapping[i].value] = sorted[i]; // let it scoot ... free(sorted); free(mapping); return end result; }

In a single among the closing sentences I talked about the Kraft-McMillan inequality (each and every so again and again McMillan is forgotten).

It is core message is: if sum(2^-codeLength[0...numCodes]) then a prefix-free code exists.


   Ideally you are trying to bear sum(2^-codeLength[0...numCodes]) = 1 ("equal" in establish of dwelling of "much less-than or "equal") because the rest is inefficient.



   The Huffman algorithm repeatedly presents you sum = 1. The Kit-Merge algorithm has the same property.
   The following dimension-limiting methods alter code lengths within the kind of capacity that the utmost code dimension is enforced ("dinky") while the inequality calm holds upright.
   Clutch that you ran the Huffman algorithm and situated out that a pair of codes are too prolonged. You know that sum = 1.



   A truly powerful insight is that making one among such prolonged codes finest one bit shorter increments the sum, thus violating the inequality: 

newSum = sum - 2^-oldLength + 2^-newLength = sum - 2^-oldLength + 2^(oldLength-1) = sum + 2^-oldLength
   Nonetheless, if there is any other image with a code dimension that is shorter than your required maxLength we can invent it one bit longer to offset our replace.



   The formula is virtually the same (but plus one in establish of dwelling of minus one within the exponent).



   Somewhat mighty all algorithm talked about within the follwing chapters pair prolonged with short codes such that the fresh dimension restrict is respected and the inequality holds upright.
JPEG's Length-Limiting Intention
   This is the shortest algorithm. It might probably be found out in Annex Okay.3 of the JPEG regular ITU T.81 alongside with a nice drift plan.



   You might perhaps perhaps additionally score that doc from w3.org, look web remark 147 of that PDF.
   First you produce a worn Huffman code. If the utmost code dimension doesn't exceed your restrict then you may perhaps perchance perhaps very successfully be obviously done.
   If you happen to glimpse at all codes having the longest code dimension then there is repeatedly an even quantity of them



   (if that wasn't the case then one image may perhaps perchance perhaps presumably be made one bit shorter because its closing bit is unused → an atypical quantity must be an error to your Huffman implementation).
   You repeat this direction of till reaching maxLength:

select two symbols x and y having that longest bit dimension
the canonical Huffman code of x and y might be virtually same: x ends with 0 and y ends with 1
let's call their shared prefix P
we resolve x's trailing 0, in remark that x becomes P, being one bit shorter
on the opposite hand, y is invalid now
let's select a third image z which is shorter than the fresh x (which implies at the least two bits shorter than the calm invalid y)
appending a 0 to z makes it one bit longer
appending a 1 to z creates an absolutely fresh code that might be old style for y

   What did we pause ?

we made 1 image shorter by exactly  1 bit (x)
we made 1 image shorter by at the least 1 bit (y)
we made 1 image longer  by exactly  1 bit (z)

   Why does it work ?



from a mathematical point of glimpse the sum of all Kraft values must be
    
we know that bits(x) = bits(y)
if bits(z) = bits(x) - 2 then sum(Kraft) = sum(Kraft')

because bits(x') = bits(x) - 1
and bits(y') = bits(y) - 1
and bits(z') = bits(z) + 1
in remark that bits(z') = bits(x') = bits(y')

let's ignore all symbols with the exception of x, y and z in remark that we've: 



        sum(Okay)  =     Okay(x)   +     Okay(y)   +     Okay(z)



                = 2^-bits(x) + 2^-bits(y) + 2^-bits(z)



                = 2^-bits(x) + 2^-bits(x) + 2^-(bits(x) - 2)



                =       2 2^-bits(x)    + 4 -bits(x)



                = 6 2^-bits(x)



        sum(Okay') =     Okay(x')  +     Okay(y')  +     Okay(z')



                = 3 Okay(x')



                = 3 2^-bits(x')



                = 3 2^(-bits(x) - 1)



                = 6 2^-bits(x)



                = sum(Okay)
therefore the sum of all Kraft values stays unchanged after the transformation

   Every step reduces the bit lengths of upright two symbols by generally one bit.



   Sadly, processing a immense alphabet with very grand bit lengths might be somewhat dull.



   Nonetheless, this allege may perhaps be very unlikely with the little JPEG alphabet (generally 162 symbols with an higher code dimension restrict of 16).
   The code simplifes and becomes sooner if work change accurate into a histogram of code lengths in establish of dwelling of the code lengths themselves.



   That capacity you sprint a single stream over all code lengths and count what number of symbols bear one bits, two bits, three bits, etc.
   The internal loop of my characteristic limitedJpegInPlace is surprisingly shprt:
   


     

      veil

     

     limitedJpegInPlace

    

    

     unsigned char limitedJpegInPlace(unsigned char newMaxLength, unsigned char oldMaxLength, unsigned int histNumBits[])

     {

       if (newMaxLength <= 1)

         return 0;

       if (newMaxLength >  oldMaxLength)

         return 0;

       if (newMaxLength == oldMaxLength)

         return newMaxLength;

     

       // iterate over all "excessive" bit lengths, starting with the longest

       unsigned char i = oldMaxLength;

       while (i > newMaxLength)

       {

         // no codes at this bit dimension ?

         if (histNumBits[i] == 0)

         {

           i--;

           continue;

         }

     

         // glimpse codes that are at the least two bits shorter

         unsigned char j = i - 2;

         while (j > 0 && histNumBits[j] == 0)

           j--;

     

         // replace bit dimension of 2 of the longest codes

         histNumBits[i] -= 2;

         // one code becomes one bit shorter

         // (using the joint prefix of the celebrated two codes)

         histNumBits[i - 1]++;

     

         // the opposite code has dimension j+1 now

         // but any other, not-yet-eager, code with dimension j

         // is moved to bit dimension j+1 as successfully

         histNumBits[j + 1] += 2;

         histNumBits[j]--;

       }

     

       // return longest code dimension that is calm old style

       while (i > 0 && histNumBits[i] == 0)

         i--;

     

       // JPEG Annex Okay.3 specifies an further line: 

       // histNumBits[i]--;

       // => because JPEG desires a somewhat a pair of image to withhold a long way flung from 0xFF in its output stream

     

       return i;

     }

    
   
MiniZ's Length-Limiting Algorithm
   The DEFLATE file format imposes a 16 bit restrict in its Huffman stage.



   Unlike JPEG there isn't the kind of thing as a legit dimension-limiting direction of in remark that programmers came up with somewhat a pair of (but somehow very same) code.



   If found out the most simply identifiable code develop into as soon as MiniZ's huffman_enforce_max_code_size() (written by Rich Geldreich).

gzip and its zlib are more standard but the capacity they restrict code lengths is considerably hidden and interweaved with other functionality.



   As a long way as I realize each and every MiniZ's and zlib's dimension-limiting behaves identically.
   While MiniZ's regular opinion is reminiscent of JPEG's (for hundreds of enter they create same output), it is mighty sooner: 



   in establish of dwelling of slowly "eating" one bit away at a time from prolonged codes, it accurate away shortens all (!) prolonged codes to the fresh maxLength restrict.



   Now the Kraft-McMillan inequality may perhaps perchance perhaps also very successfully be violated in remark that a pair of codes must calm be mounted:

select a code x with maximum dimension (=maxLength)
select any other code y that is shorter, if that you may perhaps perchance perhaps presumably be also enlighten of maxLength-1 bits (even shorter codes work, too, but are inefficient)
invent y one bit longer
attach x the same dimension as the "fresh" y
we upright reduced the Kraft-McMillan sum by (at the least) 2^-maxLength
if it calm exceeds 1 then continue with step 1

   Point to: it is that you may perhaps perchance perhaps presumably be also enlighten of (and somewhat seemingly !) that the chosen code x will get assigned the same code dimension it had sooner than.



   For example, if newMaxLength is 16 and you may perhaps perchance perhaps simply bear codes with 15 bits, too.



   Now when you select a 16 bit code and select a 2nd 15 bit code then that 15 bit code becomes a 16 bit code and the 16 bit code stays a 15+1=16 bit code.
   The Kraft-McMillan computations might be performed with integers if we multiply everything by 2^maxLength: 

2^-maxLength 2^maxLength = 1, thus each and every iteration "saves" one unit and the total sum must be under 1 2^maxLength = 2^maxLength.
   If a histogram of code lengths is old style in establish of dwelling of code dimension themselves (upright like the JPEG algorithm) then the internal loop is upright 50 strains, and most of it are comments:
   


     

      veil

     

     limitedMinizInPlace

    

    

     unsigned char limitedMinizInPlace(unsigned char newMaxLength, unsigned char oldMaxLength, unsigned int histNumBits[])

     {

       if (newMaxLength <= 1)

         return 0;

       if (newMaxLength >  oldMaxLength)

         return 0;

       if (newMaxLength == oldMaxLength)

         return newMaxLength;

     

       // my allround variable for assorted loops

       unsigned int i;

     

       // scoot all outsized code lengths to the longest allowed

       for (i = newMaxLength + 1; i <= oldMaxLength; i++)

       {

         histNumBits[newMaxLength] += histNumBits[i];

         histNumBits[i] = 0;

       }

     

       // compute Kraft sum

       // (using integer math: everything is multiplied by 2^newMaxLength)

       unsigned long long total = 0;

       for (i = newMaxLength; i > 0; i--)

         total += histNumBits[i] << (newMaxLength - i);

     

       // iterate till Kraft sum doesn't exceed 1 anymore

       unsigned prolonged prolonged one = 1ULL << newMaxLength;

       while (total > one)

       {

         // select a code with maximum dimension, this might be moved

         histNumBits[newMaxLength]--;

     

         // derive a 2nd code with shorter dimension

         for (i = newMaxLength - 1; i > 0; i--)

           if (histNumBits[i] > 0)

           {

             histNumBits[i]--;

             // prolong that shorter code by one bit

             // and repair the same code dimension to the chosen one

             histNumBits[i + 1] += 2;

     

             damage;

           }

     

         // transferring these codes reduced the Kraft sum

         total--;

       }

     

       return newMaxLength;

     }

    
   
BZip2's Length-Limiting Algorithm
   bzip2 is a extraordinary compression theory. One of its quite a bit of stages is a Huffman compression stage.



   As a replace of playing around with the Kraft-McMillan inequality it entirely depends on a hasty Huffman encoder and trial'n'error:

compute Huffman codes to your alphabet
if maximum code dimension doesn't exceed maxLength then you may perhaps perchance perhaps very successfully be done
otherwise lower each and every image's count (but withhold a long way flung from turning into zero)
and return to step 1

   There are many potentialities the style to put in force step 3. bzip2 divides by two and clears the lower 8 bits of every and every counter (and provides 1 to withhold a long way flung from zeros).



   While being very hasty, it would also simply create sub-optimal Huffman codes. Now and again dividing by 3 presents higher results but that entirely depends to your enter files.
   The core loop is: 


     

      veil

     

     bzip2

    

    

     // sprint Moffat algorithm ...

     unsigned char end result = moffatSortedInPlace(numNonZero, sorted);

     // ... till a lawful maximum code dimension is found out

     while (end result > maxLength)

     {

       // roughly divide each and every weight by two while warding off zero

       for (i = 0; i < numNonZero; i++)

       {

         unsigned int weight = mapping[i].key;

     

         // bzip2 "clears" the lowest 8 bits of the histogram

         // to reach the length limit in less iterations

         // but sacrifices lots of efficiency

         // if you set EXTRA_SHIFT to 0 then the code may need more iterations

         // but finds much better code lengths

         //#define EXTRA_SHIFT 8

         #define EXTRA_SHIFT 0

     

         // sometimes dividing the weight by a bigger integer (e.g. 3)

         // may lead to more efficient prefix codes

         #define DIVIDE_BY   2

     

         // the shifts do more harm than good in my opinion

         weight >>= EXTRA_SHIFT;

     

         // including 1 avoids zero

         weight   = 1 + (weight / DIVIDE_BY);

         weight <<= EXTRA_SHIFT;

     

         // sorted develop into as soon as overwritten with code lengths

         mapping[i].key = sorted[i] = weight;

       }

     

       // again: sprint Moffat algorithm

       end result = moffatSortedInPlace(numNonZero, sorted);

     }

    
   
   The compression efficiency is in virtually all cases worse that Kit-Merge/JPEG/MiniZ. Moreover, it is slower than JPEG/MiniZ so there isn't the kind of thing as a real motive selecting bzip2's capacity.
Kraft-McMillan Approximations
   The aim of every and every presented dimension-limiting algorithm is producing a code that is successfully matched to Huffman decoders.



   Rather then Kit-Merge, all of them must generate a straightforward Huffman code and then alter overlong symbols.
   I already talked about the Kraft-McMillan inequality which states that every and every code where sum(2^-codeLength[0...numCodes])  is a prefix-free code.



   That capacity each and every such code might be decoded by Huffman decoders, too.

   I came all over a pair of postings, mainly by Charles Bloom, discussing how fully focussing on the Kraft-McMillan inequality might be each and every hasty and shut to optimal.



   Nonetheless, I made a pair of indispensable changes to walk up issues:

compute theoretical quantity of bits for every and every image: entropy = -log2(count / total)


using an ideal-hasty hasty approximation in establish of dwelling of C's log capabilities

round up or accurate down to closest integer, right here's our initial wager for the code dimension

but under no circumstances round to zero
if above our dimension restrict, then establish of dwelling it to that restrict

compute Kraft-McMillan sum of all symbols' code lengths
   Now we've a rough estimation. If the Kraft-McMillan sum doesn't exceed 1 then right here's truly a lawful prefix-free code (but presumably it is sub-optimal).



   If that sum is above 1 then we must alter (=amplify) a pair of code lengths. The principle opinion is editing these code lengths which "damage the least".



   When we rounded to the closest integer, many symbols rounded down: their proper code dimension is smaller than their theoretical bit dimension.
   Clutch you may perhaps perchance perhaps even bear a image which represents 9% of all symbols. Then its entropy is -log2(0.09) ≈ 3.474 → rounded to 3. There is a "own" of 3.474 - 3 = 0.474.



   "Fabricate" capacity that we old style 0.474 much less bits per image occurrence than the theoretical minimal, we saved some bits.



   If we replace that image's code dimension to 4 (assuming it is allowed by the dimensions restrict), then there might be a "loss" of 3.474 - 4 = -0.526.



   "Loss" capacity that we old style 0.526 much less bits per image occurrence than the theoretical minimal, we wasted some bits.



   The Kraft-McMillan sum might be lower though, optimistically now under 1, at the worth of rather worse compression.
   Shall we embrace there is any other image representing 10% of all symbols, thus being a diminutive bit more frequent than the previous image.



   It is entropy is -log2(0.1) ≈ 3.322, rounded to 3 as successfully. Attributable to this reality it is "own" is 3.322 - 3 = 0.322 and doubtless "loss" after including one bit is 3.322 - 4 = -0.678.



   Both symbols bear an designate on the Kraft-McMillan sum within the same capacity, but the latter image will motive even worse compression.
   You bear perchance observed that including one bit to image which had a own of x results in a loss of 1-x.



   So the most most considerable opinion is: repeatedly glimpse symbols with maximum own to motive minimal loss after including rather.
   I utilized two somewhat a pair of suggestions to change the code lengths, known as A and B.



   And inch, I obtained uninterested within the prolonged title "Kraft-McMillan inequality" and upright known because it "Kraft" (which is the good German word for "energy"/"energy"/"force").
Kraft - Formula A
   This algorithm works with quite a bit of passes. A threshold, within the origin a little payment, decreases after each and every stream.



   Walking thru all symbols, we glimpse symbols where the adaptation between theoretical bit dimension and proper bit dimension is above that threshold ("excessive own").



   One bit might be added and the Kraft sum adjusted. As soon as the Kraft sum isn't elevated than 1 anymore we've a legit prefix-free code.
   It is that you may perhaps perchance perhaps presumably be also enlighten of to "overshoot" by extending too many previously very short codes - which bear a excessive affect on the Kraft sum.



   If we are successfully under a Kraft sum of 1 then a single stream looks for codes that might be made one bit shorter without violating the Kraft sum.



   In many cases this closing stream isn't wanted and might be skipped.
   The algorithm works quite successfully. Nonetheless, the quantity of passes massively depends on the enter files.



   If the initial threshold is fair too excessive and/or shrinks finest a diminutive after each and every stream then an ideal quantity of passes might be required.



   If the initial threshold is fair too low  and/or shrinks loads after each and every stream then finest few passes might be required.



   While the latter sounds considerably higher, this can scoot away out many optimal code lengths and create sub-optimal results.



   I did not derive a supreme one-suits-all configuration. The default INITIAL_THRESHOLD = 28 / 64.0 and STEP_THRESHOLD = 1 / 64.0 worked successfully in my experiments.
Kraft - Formula B
   Formula A makes a pair of guesses in the case of the threshold. Moreover, it again and again scans symbols which have to not be modified.



   My approach B implements a max-queue. Its first ingredient is repeatedly the image with the absolute top own.



   Attributable to this reality we don't prefer scan all symbols but repeatedly select from the head of that max-queue.



   After a image develop into as soon as adjusted, there is a must re-insert it into the max-queue.
   One of many most atmosphere friendly files buildings for such score admission to sample is a heap. Most regular libraries advance with a heap implementation (reminiscent of C++).



   Roughly half of the code of limitedkraftheap.c is a pretty generic and straightforward heap implementation.
   Corresponding to approach A, we would also "overshoot" in remark that a hasty repair-up exists as successfully.
   There may perhaps be not any vital incompatibility between approach A and B referring to the generated code lengths.



   Nonetheless, approach B is mighty sooner (generally by a element of 4) and even outperformes an atmosphere friendly Huffman implementation.



   It is attention-grabbing to pronounce that the dimensions restrict has a truly indispensable impact on the execution time: lower limits are sooner.
License
   This code is licensed under the zlib License: 


This utility is geared up 'as-is', with none remark or implied

warranty. In no tournament will the authors be held liable for any damages

increasing from the utilization of this utility.
Permission is granted to somebody to make employ of this utility for any purpose,

including industrial applications, and to alter it and redistribute it

freely, field to the following restrictions:
1. The starting establish of this utility must never be misrepresented; you want not

   claim that you wrote the customary utility. If you happen to make employ of this utility

   in a product, an acknowledgment within the product documentation may perhaps perchance perhaps presumably be

   liked but isn't required.

2. Altered source variations must be it appears that marked as such, and must never be

   misrepresented as being the customary utility.

3. This inspect may perhaps perchance perhaps also simply not be removed or altered from any source distribution.zlib License
    
Offer Code
   Git customers: scroll down to the repository link
   Most up-to-date liberate:  February 5, 2021, dimension: 1668 bytes, 30 strains
CRC32: 9f50f131

MD5:   9c4358817e4f951186dc6c87866ab727

SHA1:  5533b6f2fa58c109c590ae4f33665ac7d3e4b896

SHA256: beb1da8cdce260cb540ab871c5f98d46b596e3956d61324fd1567739492384f8
   Most up-to-date liberate:  June 29, 2021, dimension: 11.0 kBytes, 338 strains
CRC32: b70bd1e5

MD5:   e826eda6c202bfa2fa22daf9e3a5a246

SHA1:  fc47f5ef18d5c82996032b3d3d9c83a85df8f940

SHA256: 2dec5ba9d228eeb44bab2f90312518f546475a65982566857bc8ffe402c42a34
   Most up-to-date liberate:  February 5, 2021, dimension: 2208 bytes, 41 strains
CRC32: 979b7a5c

MD5:   264867286a2a10419e77ad7e2da9ee69

SHA1:  2cebbac1ae3c7f53ee7562ed73b9e68ca4987dff

SHA256: 04e45e82ad87c9bb938d7e68b9a39346dbe6d1c8ff9e5859894be43ec62ba406
   Most up-to-date liberate:  February 10, 2021, dimension: 4436 bytes, 175 strains
CRC32: a5935e36

MD5:   db5f8486daebbbe7179c2ed964bd7a80

SHA1:  b688a68b47c8ef654e3271a78152a5382c480820

SHA256: 1ec47ed8df361a5cb8ef847b94dc1be85fccfaaab2c08d7d56f4dd551d8b31eb
   Most up-to-date liberate:  February 5, 2021, dimension: 4835 bytes, 80 strains
CRC32: 161095e9

MD5:   715a76cfe7bb7d74e8c1e5f2fbcfb4d9

SHA1:  e0f8bd3a6b3afcd6a793fa8daab8b7b34db92cf4

SHA256: e75a18a20fe66cee275a0d0ebe52cca3bf9ad14c92994c3a11d5a68651d98819
   Most up-to-date liberate:  February 10, 2021, dimension: 12.2 kBytes, 356 strains
CRC32: c503747c

MD5:   e878141ab902742613f9863e97118ec0

SHA1:  e6a180b59caf39df790cc047876392c65c114232

SHA256: 0e85a8d136f07214b4203a4036a75a3f817e64f1edebf7028b3df55217924ed6
   Most up-to-date liberate:  February 5, 2021, dimension: 1476 bytes, 29 strains
CRC32: c958c820

MD5:   5735c2438ffbd21540c640e48a8217fd

SHA1:  e157bb74cc746609ae083ceb497bf4327f31f016

SHA256: 39a411f02183cc34a8559f973724517a6d81c68c5259478f0a05fdb8fb66658a
   Most up-to-date liberate:  February 10, 2021, dimension: 4050 bytes, 131 strains
CRC32: 821a4f0f

MD5:   af0a84f35b32310ec33ad6da058ff3e4

SHA1:  a1eff36323c9917f62f0cfd1ee000388c44a4c1e

SHA256: a8bfb557d645a1419df9fe071682d9be88af71c82905bddfb0f3d10e50964a78
   Most up-to-date liberate:  February 5, 2021, dimension: 745 bytes, 17 strains
CRC32: cb6d4745

MD5:   9f5615706645e6d591769cd9385c90bd

SHA1:  e8e526e2e32b16197ec77c593f476230221217a6

SHA256: 8ee50def030ba1dd2383febc970496c1f7b9cfe824774d133b5f75cc0741b641
   Most up-to-date liberate:  February 5, 2021, dimension: 6.9 kBytes, 224 strains
CRC32: fddbc6e9

MD5:   a210caef2d00f1be7c9da18b193df501

SHA1:  648b90061f069d74beb41decdcd7ed28c4eaffef

SHA256: 545e870024999e917bd52347419872458178f61645853e5360bf04f35e2aa63c
   Most up-to-date liberate:  February 5, 2021, dimension: 753 bytes, 17 strains
CRC32: 7a4c7aa3

MD5:   f920b31d7c745895287ae07e357b0737

SHA1:  7bf007989f4ee0ec9e6af857038fa52bf7a5b59f

SHA256: c12082e6be08b9028d80ad2f148ecfb780ea4daf9ee01b9569e7f30553dba489
   Most up-to-date liberate:  February 5, 2021, dimension: 9.7 kBytes, 342 strains
CRC32: 0a9bb4f3

MD5:   1d5fcb52464eb0ae0922ec43dddf862f

SHA1:  99e13c8494a78e5c4d602abb124da76e2064d049

SHA256: 3a4559d076c85703d1936b4ee09a9d1b2fbc71ab73ddf20ee9805ab29420fc56
Bring together up-to-date: git clone https://produce.stephan-brumme.com/dimension-dinky-prefix-codes/git
   If you happen to advance all over any bugs/considerations or bear suggestions for boosting future variations, please write me an e-mail:

   [email protected]
Changelog

version 1

June 30, 2021
initial liberate

Be taught Extra

(array index)

Code Length (initially)

Code Length

(after processing bottom row)