I have updated my github project with support for arithmetic coding. It uses the algorithm provided by Malte Clasen and Eric Bodden. It is an integer based encoder (32 bit unsigned).
I have made some changes to the original implementation to separate the statistical models more fully from the coder. This allows substituting models on a per symbol basis.
An example of this behavior is provided in the ArithmeticStream class (paralleling the compression classes in System.IO.Compression). This class uses two models: a zero order model and a new symbol model. The former is only initialized with two symbols (stream terminator and new character). The latter is initialized with all characters.
When a new symbol is encountered to compress the following steps occur:
- The NewCharacter symbol from the zero order model is emitted.
- The stats for the zero order model are updated.
- The symbol is emitted from the new symbol model.
- The model is updated to zero the weight of the symbol (since it will no longer go through this model again).
- The zero order model is updated with the symbol. Future processing of this symbol will now go through the zero order model.
for (int i = 0; i < count; ++i) { byte b = buffer[offset + i]; if (_newCharacterModel.Emitted(b)) { _coder.Encode(b, _adaptiveModel, _stream.Write); } else { _coder.Encode(ZeroOrderAdaptiveByteModel.NewCharacter, _adaptiveModel, _stream.Write); _coder.Encode(b, _newCharacterModel, _stream.Write); _adaptiveModel.Update(b); } }
/// Encodes <paramref name="symbol"/> using the <paramref name="coder"/> with /// the provided <paramref name="model"/>. /// </summary> public static void Encode<TSymbolType>(this ArithmeticCoder coder, TSymbolType symbol, IModel<TSymbolType> model, WriteBitDelegate bitWriter) where TSymbolType : struct { // cumulate frequencies Range count = model.GetRange(symbol); // encode symbol coder.Encode(count, model.TotalFrequencies, bitWriter); // update model model.Update(symbol); }
I still have a number of unit tests to write. I also want to see whether the lost range from the integer division makes a substantial difference to compression and how much work is involved to address it. Finally a C port is in the queue to do.