How to Decompress and Compress .DAT Files Used in PC Lemmings document by ccexplore small fix by Simon (in XgagrX.dat, terrain comes first, objects second) Decompression ------------- All compressed .dat files use the same compression algorithm. .dat files which are compressed include the following: XgagrX.dat levelXXX.dat XgaspecX.dat Main.dat Cgamain.dat I don't know about the other .dat files; however, importantly enough, the groundXo.dat files are NOT compressed. The XgaspecX.dat has a second layer of compression underneath the .dat compression scheme (that is, once you decompress an XgaspecX.dat file using the normal .DAT decompression algorithm, you have to apply yet another decompression algorithm to get to the actual data); I'll explain that in a separate document. A compressed .dat file consists of some number of "compressed data sections" one following another. Each compressed data section consists of a 10-byte header and then the actual compressed data, and immediately after that comes the next compressed data section with its header and data, and so forth. A header has the following layout: byte num_bits_in_first_byte byte checksum word unused1 word decompressed_data_size word unused2 word compressed_data_size I'll explain num_bits_in_first_byte later. The checksum is an exclusive or ("xor") of all the bytes in the actual compressed data (ie. excluding the header). The decompressed_data_size refers to the number of bytes you'll get when you decompressed the data. The compressed_data_size refers to the SUM of the number of bytes in the header (always 10) plus the number of bytes in the actual compressed data. unused1 and unused2 are not used and are set to zero. --------------------------------------------- In explaining the decompression algorithm, for now imagine that instead of bytes, the compressed data (minus the header) is actually a stream of bits. I'll explain later how this conceptual stream of bits is stored as the bytes found in the file. The decompression algorithm builds the decompressed data backwards, starting from its last byte going byte-by-byte towards the first byte. Compression is achieved because the stream of bits can encode both raw bytes, and references to a block of consecutive bytes that had been decompressed previously. Specifically there are 6 different ways the bits encode bytes of the decompressed data: 1) 00 nnn xxxxxxxx xxxxxxxx ... Two consecutive 0 bits encode the start of a short chunk of raw bytes. The next 3 bits encode the number of bytes in the encoding, minus 1. Then the following groups of 8 bits each encode a byte. Remember that decompression goes backwards byte by byte, so the last xxxxxxxx gives you the earliest byte. example: 00 000 00010000 This decompresses to the byte 00010000 = 0x10 (hex) or 16 (decimal). The 00 means a short chunk of raw bytes. The 000 means the number of bytes encoded is 1 since 1 - 1 = 0. Then 0x10 comes from the next 8 bits which was 00010000. example: 00 001 00010000 11111111 This decompresses to the bytes 0xFF 0x10. We know 2 bytes are encoded because of the 001. The 0xFF comes from 11111111 and the 0x10 from 00010000. Notice that although the bits that encodes 0xFF comes later, because decompression reconstructs the bytes backwards, the 0xFF goes before the 0x10 in the decompressed data. example: 00 100 11111111 10101010 00001111 11110000 00000000 This decompresses to 0x00 0xF0 0x0F 0xAA 0xFF. Because only 3 bits are used to encode the number of bytes, only a maximum of 8 bytes can be encoded in this format. 2) 01 mmmmmmmm A "01" encodes a reference to two consecutive bytes in previously decompressed data. The mmmmmmmm denotes the offset, minus 1, to the END of the two-byte block from the current position. For the following examples, supposed so far we have decompressed the following bytes 0x01 0x02 0x03 0x09 0x07, and then we encounter the 01 mmmmmmmm in the stream of bits. This will decode into two bytes, and since decompression reconstructs the data backwards, these next two bytes will go BEFORE the 0x01. example: 01 00000001 The 00000001 means offset - 1 = 1, so offset = 1 + 1 = 2. So the 2 bytes referred to are "0x01 0x02". So after decoding these bits the decompressed data becomes 0x01 0x02 0x01 0x02 0x03 0x09 0x07 example: 01 00000100 00000100 is 4, so offset = 4 + 1 = 5. So we are referring to the "0x09 0x07", and so after decoding we get 0x09 0x07 0x01 0x02 0x03 0x09 0x07. example: 01 00000000 This might look like an illegal encoding, but is actually valid and meaningful! The offset is 1. What does that mean? Well, it decodes into 0x01 0x01, making the decompressed data become 0x01 0x01 0x01 0x02 0x03 0x09 0x07. If you think about it this makes sense because in the final decompressed data we in fact DO have two occurrences of the "0x01 0x01" block; it just so happens that because the offset is 1, these two blocks overlap: _______ | | 0x01 0x01 0x01 0x02 0x03 0x09 0x07 |_______| I'll provide similar examples later in other encodings, so hopefully you'll understand this crucial final point by then. It is crucial because as you'll see later, this concept allows for an efficent way to encode a long run of the same byte value. Notice that since only 8 bits are allotted for the offset - 1 value, the maximum offset that can be encoded in this format is 256, corresponding to 11111111. 3) 100 mmmmmmmmm This is similar to #2. It is also a reference, but to three consecutive bytes in previously decompressed data. The offset field is now 9 bits instead of 8, thereby allowing you to reference as far as 512 bytes off from the current position. Again as before, we'll assume we have so far decompressed and gotten the bytes 0x01 0x02 0x03 0x09 0x07 example: 100 000000011 So 000000011 is 3, so the offset is 4, and therefore the block of 3 bytes referred to are "0x02 0x03 0x09". The decompressed data then becomes 0x02 0x03 0x09 0x01 0x02 0x03 0x09 0x07. Here is a diagram depiction: ____________ | | 0x02 0x03 0x09 0x01 0x02 0x03 0x09 0x07 |____________| example: 100 000000001 The offset is 2, so again we are talking about two overlapping blocks of 3 bytes in the original decompressed data. So the block of bytes generated by the encoding is 0x02 0x01 0x02, and the decompressed data then becomes: ____________ | | 0x02 0x01 0x02 0x01 0x02 0x03 0x09 0x07 |____________| Again, I marked the two 3-byte blocks so you can clearly see how the block of 3 that was generated from decoding the bits is indeed a copy of the other block of 3 that is an offset of 2 away. example: 100 000000000 Hopefully by now it should make sense to you that the bits decode into the 3 bytes 0x01 0x01 0x01, and the decompressed data becomes 0x01 0x01 0x01 0x01 0x02 0x03 0x09 0x07. But just in case, here's the diagram depiction: ____________ | | 0x01 0x01 0x01 0x01 0x02 0x03 0x09 0x07 |____________| Another way to work out the bytes the bits decode to in the situations with overlapping blocks, is to consider copying the bytes from the referenced block to the destination starting from the last byte back to the first. So in the second to last example above, first we copy the 0x02: _________ | | v | ____ ____ 0x02 0x01 0x02 0x03 0x09 0x07 Then the 0x01: _________ | | v | ____ 0x01 0x02 0x01 0x02 0x03 0x09 0x07 And finally, the 0x02: _________ | | v | 0x02 0x01 0x02 0x01 0x02 0x03 0x09 0x07 4) 101 mmmmmmmmmm This is similar to #2 and #3. It is a reference to four consecutive bytes in previously decompressed data. With a 10-bit offset field, you can reference as far as 1024 bytes off from the current position. Again, assume we've so far decompressed and gotten 0x01 0x02 0x03 0x09 0x07. example: 101 0000000100 0000000100 is 4, so offset is 5, and the bits decode to the bytes "0x02 0x03 0x09 0x07". The decompressed data becomes 0x02 0x03 0x09 0x07 0x01 0x02 0x03 0x09 0x07. example: 101 0000000010 Offset is 3, and the bits decode to the bytes "0x03 0x01 0x02 0x03". Decompressed data becomes 0x03 0x01 0x02 0x03 0x01 0x02 0x03 0x09 0x07 5) 110 nnnnnnnn mmmmmmmmmmmm This is similar to #2, #3 and #4; in fact it's essentially a generalization, allowing you to specify both the length of the block of bytes you are referencing as well as its location (offset). The 8 bits nnnnnnnn following the 110 specifies the length of the block, minus 1, allowing for a maximum block length of 256. The mmmmmmmmmmmm is a 12-bit offset field that lets you reference as far as 4096 bytes off from the current position. Although you can use this format to encode references to blocks of length 2, 3 or 4, the previous encoding formats are obviously more efficent for those purposes, assuming their offset field is large enough to hold the offset in question. For length 1 this encoding takes far more bits then if you just encode this directly using the "00 nnn xxxxxxxx ..." format. So this encoding is primarily for encoding references to block lengths which are 5 or higher, or occasionally to a block whose offset is too far away to be captured in the more efficient encodings. Here for our examples below, assume the decompressed data so far are 0x0A 0x02 0x03 0x04 0x05 0x06 0x07 0x08. example: 110 00000100 000000000110 nnnnnnnn here is 00000100 = 4, so the length of block is 5. The offset field is 000000000110 so the offset is 6+1 = 7. So the reference is to a block of 5 bytes an offset of 7 away from the current position, so the bits decode into "0x03 0x04 0x05 0x06 0x07", and the decompressed data now becomes 0x03 0x04 0x05 0x06 0x07 0x0A 0x02 0x03 0x04 0x05 0x06 0x07 0x08. example: 110 00001111 000000000000 So the block length is 16 and the offset is 1. This is an example of how you can encode a long run of the same byte. The bits decode to 16 "0x0A"s, and so the decompressed data becomes a run of 17 "0x0A"s (including the one that was already decompressed earlier), follow by the previously decompressed "0x02 0x03 0x04 0x05 0x06 0x07 0x08". 6) 111 nnnnnnnn xxxxxxxx xxxxxxxx ... Finally, this is similar to #1. It encodes raw bytes, but allows you to encode more than the maximum of 8 possible with the "short raw data" format that is #1. The 8-bit value nnnnnnnn is the number of bytes encoded minus 9, so you can encode up to 255 + 9 = 264 raw bytes with this encoding. Again, because decompression always build the data from back to front, the first xxxxxxxx encodes a byte closest to the end of the decompressed data while the last xxxxxxxx encodes a byte closest to the beginning. example: 111 00000000 00000001 00000010 00000011 00000100 00000101 00000110 00000111 00001000 00001001 This decodes to "0x09 0x08 0x07 0x06 0x05 0x04 0x03 0x02 0x01". We know there are 9 bytes from the "00000000". And again notice the backwardness that causes the first 8 bits following the "00000000" to actually be encoding the 0x01 that goes to the end, while the last 8 bits in the encoding yields the bytes that goes to the beginning. As a final exercise summing up all 6 encoding formats, work it out for yourself that the stream of bits: 111 00000000 00000001 00000010 00000011 00000100 00000101 00000110 00000111 00001000 00001001 101 0000000111 01 00000001 00 001 00000001 11111111 110 00000110 000000000010 100 000001110 decompresses to the following bytes: 0x04 0x03 0x02 0x05 0xFF 0x01 0x05 0xFF 0x01 0x05 0xFF 0x01 0x05 0x04 0x05 0x04 0x03 0x02 0x09 0x08 0x07 0x06 0x05 0x04 0x03 0x02 0x01 You determine when to stop decompressing by keeping track of how many bytes you have decompressed so far. Once you have decompressed as many bytes as dictated by the value of decompressed_data_size in the header, you stop. ----------------------- Finally, how to retrieve the stream of bits from the actual bytes of compressed data. It turns out the stream of bits are also stored in a "backwards" fashion into the bytes of the compressed data, so that bits at the front of the stream goes into the last bytes of the compressed data, and bits at the end of the stream goes into the first bytes of the compressed data. Specifically, starting from the last byte of the compressed data, you pull out the bits of that byte one-by-one starting from the rightmost bit (ie. the "least significant" bit) until all 8 bits are pulled, then you go to the next to last byte and repeat the process, and so forth until you end up at the first byte of compressed data. Actually, one more important detail: remember the num_bits_in_first_byte field of the header? The value of that field in the header dictates a special treatment for the first byte you generate the bitstream with (namely, the last byte of compressed data): after you pulled num_bits_in_first_byte many bits from the byte, you skip to the previous byte, from which point on you will always pull all 8 bits from a byte before moving on. If num_bits_in_first_byte happens to be zero (shouldn't happen, but if), then this means you don't pull any bits from the last byte of compressed data, but instead start off by directly skipping to the second to last byte of compressed data. Clearly num_bits_in_first_byte should normally range from 1 to 8. The purpose of this value is because the total number of bits in the bitstream may not be a multiple of 8, in which case one byte will only be holding 7 or less bits of the bitstream. Here are some examples, showing the compressed data, the value of num_bits_in_first_byte, and the corresponding stream of bits you will do your decompression from. (Note: since the examples are meant only for illustrating the compressed data -> bitstream process, the resulting stream of bits might not actually work out correctly if you try to decompress from it.) Example 1 compressed data: 0xA3 num_bits_in_first_byte = 8 corresponding stream of bits: 1 1 0 0 0 1 0 1 Explanation: 0xA3 is 10100011 in binary. As explained earlier, you pull the bits out from the right, so the first bit in the stream is the rightmost bit of 0xA3, the second bit in the stream is the bit second from right in 0xA3, etc. Because num_bits_in_first_byte = 8, all 8 bits of the byte contributes to the bitstream. Example 2: compressed data: 0x57 0xA3 num_bits_in_first_byte = 8 corresponding stream of bits: 1 1 0 0 0 1 0 1 1 1 1 0 1 0 1 0 Explanation: Remember, we start off with the last byte of compressed data to start the bitstream. And here, the last byte is 0xA3 as before, plus num_bits_in_first_byte is still 8. So the first 8 bits of the bitstream is identical to what we had in example 1. The next 8 bits of the bitstream then comes from the byte preceding 0xA3, namely 0x57 which is 01010111 in binary. Since we pull bits from the right as before, the 9th-11th bits in the bitstream are 1, then the 12th bit is 0, etc. Example 3: compressed data: 0x57 0xA3 num_bits_in_first_byte = 6 corresponding stream of bits: 1 1 0 0 0 1 1 1 1 0 1 0 1 0 Explanation: The compressed data is exactly the same as example 2, but num_bits_in_first_byte is now 6 instead of 8. The effect is that after pulling 6 bits from 0xA3, we ignore the remaining 2 bits, instead skipping ahead to the preceding byte of 0x57 and pull the 7th bit and onwards from there. Notice that num_bits_in_first_byte only applies to 0xA3 (the last byte of compressed data, which is the FIRST byte used for constructing the bitstream; hope the nomenclature isn't too confusing), so when we work with 0x57, we still use all 8 of its bits, not just 6. Because 2 bits are ignored, if we had 0x57 0x23 instead of 0x57 0xA3, the resulting stream of bits would still be the same. Whereas in example 2 if we had 0x57 instead of 0x23, the 8th bit of the corresponding bitstream would've been a 0 instead of a 1. ============================================ compression ----------- In general, given a particular compressed file format, there are basically just one decompression algorithm, but potentially infinitely many compression algorithms that work (some better than others of course). So the discussion here is just a suggestion of one possible compression algorithm that would work with the .DAT file format. I've tested it though and it works pretty well for .DAT files that are levelpacks (in fact, almost on par with the compression algorithm used by Psygnosis itself, the latter being a hair more compact, and definitely far surpasses the compression algorithm lemedit uses). It doesn't work too well for the other .DAT files (resulting in filesizes 1.5 to 2 times bigger than what Psygnosis managed). It should be clear from the discussion in decompression, that you achieve compresssion through the use of those reference encodings, where you reference a block of bytes instead of encoding its bits directly in the bitstream, thereby saving the number of bits used. For example, 3 bytes normally takes 8x3 = 24 bits, but if you can use a "100" encoding instead, the same 3 bytes would only take 13 bits. Also observe that references to blocks of n bytes can always be think of as n - 1 consecutive references to 2-byte blocks. For example: _________________ | | 0x01 0x02 0x03 0x04 ...(some intervening bytes)... 0x01 0x02 0x03 0x04 |_________________| Can be thought of as 3 consecutive length-2 references: _______ _______ | | | | 0x01 0x02 0x03 0x04 ...(some intervening bytes)... 0x01 0x02 0x03 0x04 |_______| |~~~~~~~| |_______| |~~~~~~~| So my approach to compression is basically as follows: 1) Scan the original uncompressed data backwards from end to beginning. 2) Use a table to keep track of occurrences of pairs of consecutive bytes. Use it to keep track of the most recent occurrence for each of the 65536 possible pairs of two consecutive bytes. (If you know what a hash table is, you can use that to avoid having a giant table with 65536 entries. Well, 65536's not really that big of a table, except maybe in DOS and even then you might still be okay.) 3) As you scan the uncompressed data, create a list of 2-byte references, based on the table you use to keep track of occurrences. For example, suppose your current two consecutive bytes are 0x01 0x02, and the table recorded the closest previous occurrence is 6 bytes away, then this means a 2-byte reference in the current position with offset = 6. Add that to the list of 2-byte references you've found so far. But what if you already have on your "reference list" a 2-byte reference that overlaps with the current 2-byte reference you are trying to add? (This can happen for example with the diagrammed example above.) Well, there are two cases: i) The reference you are trying to add and the other reference it overlaps with both have the same offsets. Then basically we're talking precisely about the situation depicted in the diagrammed example above, where the two 2-byte references are really a single 3-byte reference. So instead of adding a separate reference, increase the length of the reference already on the list by 1. ii) In the other case, the offsets are different. Then they really are separate 2-byte references, just so happens that they overlap. You can't have both though in the final compressed data, since each reference when decompressed generates exactly 2 bytes. So you'll need to make one of the two a 1-byte refernece instead. Except 1-byte references can only be represented with the 23 bit "110" format, which is far too inefficient. So in other words, you drop one of the two 2-byte references. This means you don't add the 2-byte reference you would've otherwise added to the list. 4) Then you scan through the list of references and makes adjustments, based on the fact that: i) the maximum offset of a reference is limited depending on its length ii) references of lengths > 256 must eventually be broken down into two or more references of length 256 or less There are many additional things you can consider here. For example, although a lone 2-byte reference allows you to store 16 bits as 10 bits in the compressed data, if the bytes to either side of the 2-byte destination block (the block you are copying bytes into) can only be encoded in one of the two "raw" formats, then in effect you are splitting what could've potentially been just one unbroken raw encoding into two. Since a raw encoding has overhead (5 or 11 bits depending on number of bytes encoded, which is in addition to the bits you need to encode the raw bytes themselves), the 6 bits you saved in the 2-byte reference might not be worth it. If you look at my C++ code, the method Compressor::optimize_refchunks deals with these sort of things. That being said, don't get too clever with this. (I'd say even my code is probably a bit overkill in this aspect.) 5) Then calculate the total number of bits you'd need to encode the references and the intervening raw-byte encodings. Use that to figure out what num_bits_in_first_byte should be, as well as how big to make your buffer for holding the final compressed data. 6) Then generate the compressed data by encoding the reference and raw bytes. Finally generate the checksum, and then create and fill the header appropriately. Remember that the compressed_data_size in the header actually refers to both the header itself and the actual compressed data, so the value would actually be 10 + the number of bytes of actual compressed data. This is in a nutshell the approach I used. For details see the code for the class "Compressor" in compression.cpp. Notice that you only have to keep track of potential 2-byte references, since longer length references can naturally come out from step 3i); in particular, since you always keep track of just the most recent occurrences, my approach will automatically identify things like long runs of the same byte. Of course, the same choice of keeping track of just the most recent occurrence means sometimes my approach might miss a long n-byte reference with a large offset, because there was an intermediate short reference with a smaller offset which caused the table to lose track of the earlier occurrence that would've allowed a longer reference to be used. Finally, if you find yourself too confused, consider not doing any compression, instead simply encode the data using only raw-byte encodings only. The resulting "compressed" file would of course be bigger than the original decompressed data, but it'll make programming a lot easier. The main reason compression was used anyway was so the game would fit on a floppy back then; nowadays you have more than enough memory and disk space to deal with the data without compressing it, and I think for the most part the PC Lemmings program will let you get away with doing a horrible job with compression, as long as you generate a .DAT file that follows the expected format. ============================================ Okay, I Decompressed; Now What? ------------------------------- Once you decompress a .DAT file into its original uncompressed data, the next step of course will depend on the purpose of the data. The uncompressed data format of a levelXXX.dat file will obviously be different than that of a XgagrX.dat file. The details on this really belongs in separate documents. Moreover, right now I only know about the uncompressed data formats for levelXXX.dat, vgagrX.dat, and vgaspecX.dat. But I'll make a brief explanation here. The levelXXX.dat is fairly simple after you decompress it. Each compressed data section (the combination of a header and the compressed bytes) corresponding to a single level. After decompressing a single section, you get 2048 bytes of data exactly in the .LVL format lemedit uses for its so-called "direct save" format. You can find the details of the .lvl file format by Googling on the web. The vgagrX.dat holds exactly 2 compressed data sections per file. I trust that you know or gathered that each the vgagrX.dat file corresponds to a "normal graphics set" of the same number, specifically the graphics used in the EGA/VGA mode. The first data section contains the bytes for the bitmaps used for the terrain pieces, while the second data section contains the bytes for the interactive objects' bitmaps. In order to further interpret these bytes, you will need to also parse the groundXo.dat files, which holds the descriptions (eg. size, activation area, number of frames of animation) to the objects and terrain of the graphics set, as well as an offset specifying where in the uncompressed data of vgagrX.dat you will find the bitmap data for the object/terrain. I'll discuss this in a future document (both the vgagrX.dat and the groundXo.dat). Finally, the vgaspecX.dat each holds exactly a single compressed data section. I trust that you know or gathered that each vgaspecX.dat file corresponds to a so-called "extended graphics set", used by the 4 "special levels" in PC Lemmings (the 'spec' in the filename stands for "special" by the way, I think). As mentioned earlier, the vgaspecX.dat files uses two levels of compression, so after decompressing the .dat the normal way, you have to apply a second, different decompression algorithm. Fortunately, this second decompression algorithm (and the corresponding compression algorithm) is much more straightforward. After this second round of decompression, you will end up with basically a 960x160 bitmap. This is 3 screens worth of graphics. The bad news is, the so-called "extended graphics set" is not really a graphics set at all. Instead, it is basically just a bitmap representing the entire terrain of a level! So you can dash your hopes now on trying to creating a level using the graphics of "A Beast of a Level". Well, you can, but you'll be forced to use the entire terrain exactly as it is from the level, not individual parts and elements of it. I'll discuss vgaspecX.dat in another future document. The cgagrX.dat and cgaspecX.dat are obviously the CGA counterparts to vgagrX.dat and vgaspecX.dat. I didn't bother to figure out how to use them since all PCs nowadays capable of running Win95 or better is required by Windows to support VGA or better. EGA and VGA uses basically identical bitmaps except the palettes are different. (The palettes are stored in groundXo.dat by the way.) As for main.dat and the CGA counterpart cgamain.dat, they obviously correspond to other game graphics, like the intro screen, the fonts, the background texture used in the password and level preview screens, etc. I have not yet tried to figure this out yet, and certainly for level-editing purposes I'd never need to.