Practice Assignment

(encoder.c)

Information

  1. For this practice assignment, you will write two functions. The first function is called encode and its prototype is this:
    void encode(FILE *infile, FILE *outfile);
    
    The second function is called decode and this is the prototype:
    void decode(FILE *infile, FILE *outfile);
    
    Looking at the names of the functions, you can probably guess what they do. One function encodes some file's data, and the other function decodes the data. There are no return values from the functions. We will assume that if the files can be opened (by the driver), then you should be able to encode/decode them successfully.

    The encoding/decoding algorithm that you are going to implement is a lossless data compression algorithm known as the RLE algorithm, or Run-length encoding algorithm. It's a simple algorithm that even beginning programmers can implement. In a nutshell, a data compression algorithm (like those used to create .zip or .rar files) simply removes redundancy from a file. This will generally make the file smaller (sometimes, much smaller) than the original file.

    As an example, suppose that you had a file (A) that contained 100 characters and they were all the letter 'A' (ASCII 65). That file would be 100 bytes in size and look like this:

    AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
    
    Looking at it with dumpit, it looks like this:
    A:
           00 01 02 03 04 05 06 07  08 09 0A 0B 0C 0D 0E 0F
    --------------------------------------------------------------------------
    000000 41 41 41 41 41 41 41 41  41 41 41 41 41 41 41 41   AAAAAAAAAAAAAAAA
    000010 41 41 41 41 41 41 41 41  41 41 41 41 41 41 41 41   AAAAAAAAAAAAAAAA
    000020 41 41 41 41 41 41 41 41  41 41 41 41 41 41 41 41   AAAAAAAAAAAAAAAA
    000030 41 41 41 41 41 41 41 41  41 41 41 41 41 41 41 41   AAAAAAAAAAAAAAAA
    000040 41 41 41 41 41 41 41 41  41 41 41 41 41 41 41 41   AAAAAAAAAAAAAAAA
    000050 41 41 41 41 41 41 41 41  41 41 41 41 41 41 41 41   AAAAAAAAAAAAAAAA
    000060 41 41 41 41                                        AAAA
    
    However, because there are so many consecutive letters that are the same, we can encode that information in fewer bytes. So, instead of writing 100 copies of 'A' to the file, we will write one letter 'A', and then follow it with a count. The count is the actual number of letter A's that were in the original file. So, our "encoded" file (A.rle) will have only two bytes in it (shown in hexadecimal):
    41 64
    
    Hex 41 is decimal 65 (the letter 'A') and hex 64 is decimal 100, the number of times to repeat the character. Looking at the binary file with dumpit:

    A.rle:
           00 01 02 03 04 05 06 07  08 09 0A 0B 0C 0D 0E 0F
    --------------------------------------------------------------------------
    000000 41 64                                              Ad
    
    That's a HUGE difference! The original file required 100 bytes on the disk, but the compressed file only required 2 bytes! ("Welcome to the wonderful world of data compression!")

    Of course, at some point we will have to "decode" the file and expand it back to the original size and contents. That's what the decode function will do. It will take as input a compressed file containing just those 2 bytes, and create a new file that has 100 characters in it, all of them are the letter 'A':

    AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
    
    Pretty cool! Of course, if the file doesn't contain a lot of redundant data, the RLE algorithm will actually make the file larger, up to twice the size. But, this can happen with other compression algorithms, although to a lesser degree. As an example, suppose our file (ABCDE.txt) contains just the letters ABCDE (5 bytes):

    ABCDE.txt:
           00 01 02 03 04 05 06 07  08 09 0A 0B 0C 0D 0E 0F
    --------------------------------------------------------------------------
    000000 41 42 43 44 45                                     ABCDE
    
    Encoding this file to ABCDE.rle causes it to be twice as large (10 bytes):
    ABCDE.rle:
           00 01 02 03 04 05 06 07  08 09 0A 0B 0C 0D 0E 0F
    --------------------------------------------------------------------------
    000000 41 01 42 01 43 01 44 01  45 01                     A.B.C.D.E.
    
    So, the more redundancy in the file, the better the compression. Again, this is true for all compression algorithms.

    Here is a driver file: HTML   Text

    The name of your implementation file should be encoder.c and the command to compile it will look like this:

    gcc -O -Werror -Wall -Wextra -ansi -pedantic main.c encoder.c -o encoder
    

    To run the program and encode a file named A300.txt into a file named A300.rle:

    encoder A300.txt A300.rle
    
    To decode the encoded file:
    encoder -d A300.rle myA300.txt
    
    Simple test files to get you started: files.zip    There are two sets of files. Files starting with in are input files, and files starting with out are the encoded output files. You'll see that for each input file, there is a corresponding encoded file. Start with the smaller ones first.

    After you've successfully tested your code with the simple files above, try to encode these bigger files:

    Input fileOutput file
    (encoded)
    Hex output
    (dumpit)
    A300.txtA300.rle A300-hex.txt
    icon1.bmpicon1.rle icon1-hex.txt
    smiley-bw.bmpsmiley-bw.rle smiley-bw-hex.txt
    black.bmpblack.rle black-hex.txt
    dude.jpgdude.rle dude-hex.txt

    All of the big files in one zip: big-files.zip

    Approximate number of lines of code for encode: 30
    Approximate number of lines of code for decode: 8

Notes

  1. Both functions take two FILE pointers. The driver has already opened both of them and will close them. DO NOT close these files yourself. The first FILE pointer is opened for read/binary and the second FILE pointer is opened for write/binary.
  2. The only header file that you need to include is stdio.h.

  3. Since the largest value that will fit into a character is 255 (0xFF), that's the most number of bytes that can be encoded at one time. So, if the letter 'A' appears 300 times consecutively, you will have to store 2 chunks. The first chunk encodes the first 255 bytes, and the second chunk encodes the remaining 45 bytes:
    A300.rle:
           00 01 02 03 04 05 06 07  08 09 0A 0B 0C 0D 0E 0F
    --------------------------------------------------------------------------
    000000 41 FF 41 2D                                        A.A-
    
  4. You should start with the very simple files before attempting to encode the very large ones. There is a .zip file that contains several simple test cases (in files) along with their corresponding encodings (out files).
  5. Because the encoded files (.rle) are binary files (not text files), you won't be able to use diff to compare your output with the expected output. You can use a program called cmp (from the Cygwin toolkit), which is meant for binary files. If the files are identical, cmp won't say anything. However, if they are different, you will see output that resembles this:
    file1.rle file2.rle differ: byte 492, line 2
    
    which isn't very helpful. See then next note.
  6. You should use the dumpit program (or similiar) to see what your binary encoded files look like to help debug your implementation. You can't use a text editor to view the encoded files because they are not text files. If you run dumpit on both encoded files (your file and the expected output file) and then run diff on those (or a visual diff tool like WinMerge or kdiff3), you'll get a better view of what is wrong. So, using the example from above, run these commands:
    dumpit file1.rle > file1-hex.txt
    dumpit file2.rle > file2-hex.txt
    diff file1-hex.txt file2-hex.txt
    
    and you will be able to see which bytes are different using diff. Hint: use a visual diff tool.