Posts

Showing posts from August, 2013

tzip - file compressor

Image
tzip  is (probably) the simplest text-file compressor. It compresses a file by counting the number of different bytes and map those bytes to shorter sequences of bits. Each byte is composed from 8 bits. A byte value can range from 0 to 255 (inclusive), which means that 8 bits can be mixed up to represent only 256 different bit-sequences. In a text file, however, not all this bytes are used, and for a lower number of patterns, we can use less bits per pattern. For example, if we take the word ' youtube ', it is composed from 7 bytes, but has only 6 different bytes: 'y' , 'o' , 'u' , 't' , 'b' , 'e' To represent this bytes, we will need sequences of only three bits. To calculate exactly how many bits are required in a sequence for a given number of different bytes ( n ) we have the following formulas:      ceil ( log ( n ) / log ( 2 ) ) ; or: p = next_power_of_two ( n ) ; log