Corruption of Graphics Files

What is a corrupt graphics file?

Imagine every munged image you've ever seen displayed by a graphical display program--shifted or broken pictures, large areas of snowy or mosaic patterns, colors that only seriously appear in Andy Warhol paintings, or simply no image displayed at all. Why does this happen?

Causes of Corruption

When an image fails to display properly, the cause(s) might be any or all of the following:

Problems with the display environment

Most graphical display problems can be corrected by adjusting one or more aspects of the display environment. For example, if you attempt to display a truecolor image using a video graphics card or software driver that does not support the full bitdepth or resolution of the image, the display program will either reduce the number of colors in the displayed image, or simply refuse to display the image at all. In either case, the results will probably not look as you expected.

Installing the proper software driver for the graphics card, resolution, and number of colors desired may fix this display problem. Upgrading your display program to a newer version or using a different program is also an option. And, of course, who couldn't use a faster graphics card with more memory, and a larger, higher-resolution display monitor as well?

Sometimes the fault lies with the file reader. A display program should make every attempt to verify that it understands the format of the data it is reading. For example, reading a JPEG file as a GIF file may cause a display program to produce unexpected results.

Many file formats have different internal variations depending upon the revision level of the format and the type of data that the files store. Most formats (e.g., TIFF and TGA) make it easy to determine the type of data stored in the file by looking at header information, and others (e.g., PCX) make it somewhat more difficult. Some formats (e.g., Amiga IFF) use a different file extension for every type of data that they store, and others (e.g., XBM) store only one type of data.

You're asking for trouble if you assume from a file's extension that the file has a specific format. More than one file has been given an improper file extension. And if the display program is accepting input directly from a data stream, there won't be a file extension to read in any case.

With most formats, the file extension doesn't change to reflect the revision level of the file format, or the type of data stored in the file. This lack of human-readable recognition has probably hurt the TIFF file format the most. A TIFF file can store any type of image data ranging from monochrome to truecolor, and can compress it using any one of a half-dozen or more methods of encoding.

Many TIFF viewers support RLE-compressed or uncompressed monochrome images, but not images compressed using CCITT G3 and G4 encoding. Other TIFF viewers support gray-scale and palette color images but not truecolor images. Some viewers support the display of images compressed using the JPEG or TIFF-LZW algorithms, and (most) others do not.

As you can see, many different types of images can be stored in a TIFF file--and all of them will have the ".TIF" extension. It's no wonder that users become frustrated when some TIFF files display and others don't; after all, in a directory listing all the files "look" the same. Make sure that not all files look the same to your file reader.

Problems with the program code

We've discussed the fact that an image might fail to display properly because the program reading it lacks the features to display the image data. It is also quite possible that an image may fail to display because of bugs in the display program's code. Such bugs usually result from the fact that the programmer misinterprets the file format specification, or that he doesn't fully understand the programming language that he's using, or that he includes some other programmer's buggy code in his own code. Testing the display program on a wide variety of image files will reveal such problems.

Problems with the data

What about the graphics file? Can the file itself be the source of the problem?

A graphics file is no different from any other type of data file on your system. Like other files, the data a graphics file contains may be incorrectly constructed (bad data) or damaged (corrupt data). Bad or corrupt files occur as a consequence of one or more of the following problems:

Bad data may result from a poorly designed file format writer. Improperly calculating header values (e.g., number of colors, resolution, file size, etc.) or writing the data in an incorrect byte order will mislead a format reader into incorrectly interpreting the image file data. Buggy codecs (encoder/decoders) may produce badly encoded data that may appear to be doing its job but actually violates the compression algorithm. (For example, the TIFF specification implements the LZW algorithm in a faulty way.)

Files may become corrupted in a variety of ways. For example:

Detecting File Corruption

File format readers must be able to quickly detect that a file's data is incorrect or unexpected, or that it in some way violates the specification of the file format or data compression algorithm. Quick detection allows the reader to respond with an error message to the user, and prevents the untimely crash of the program by the reading of bad data. An accurate analysis of the problem by the file reader and a verbose error message displayed to the program user are also required.

How can you tell whether a file is corrupt? We describe several indicators below.

EOF marker

The end of file (EOF) file stream marker is a good indicator. Often, graphics files are truncated through errors in transmission or by a failed write operation to a disk. In such cases, when the file is read, the EOF will occur much sooner than a file format reader would have expected, and corruption of the file may be assumed. Read operations will also fail if there is an actual error in the filesystem or disk. Always check the return value of your read operations. An unexpected EOF, or any file stream error, is a sure sign that something is wrong.

Unexpected characters

Missing or excessive data may cause an improper alignment of the internal structures of a file format. Data structures in memory often contain invisible 2- or 4-byte boundary padding between structure elements that may unintentionally be written to a file. Data written to a file opened in text mode, rather than in binary mode, may contain embedded carriage return and/or linefeed characters and may therefore create bad data.

Magic value errors

Stream-oriented formats divide stored data into individual sections called segments (blocks, chunks, etc.), each of which begins with a specific identification or "magic" value followed by the length of the data in the segment. If a format reader reads in an entire segment and discovers that the next data in the file is not the expected magic value of the following segment (or the end of data stream marker), then the reader assumes that the data is bad or corrupt.

Out-of-range offset values

File-oriented formats typically use fixed-size data structures and absolute offset values to locate data. An offset value that points outside the file space is a sure indication that the offset value is wrong, or the file has been truncated.

Hints for Designing File Readers and Writers

What should a file format reader do in the case of missing or excessive data? It depends on the file format and the data itself. If the bad information is trivial (e.g., a text comment or a thumbnail image), the reader may choose to ignore the bad data and continue reading the file. If the information is critical (e.g., the header), the reader should simply give up.

Regardless of the action it takes, a file reader should display a warning, error, or diagnostic message to indicate that something unexpected has occurred. Messages such as "Unknown file format", "Unknown compression type", "Unsupported resolution", or "Corrupt data" will at least give the user a clue as to what is wrong.

Here are some tips for designing a error-detecting file format reader:

Here are some tips for designing a file format writer:

Most file formats do not have built-in mechanisms for error detection. Instead, we rely on the file reader to recognize bad or corrupt data, based on information stored in the file, and to react accordingly. Some formats store the size of the graphics data, or even the length of the entire file, in their headers. Other formats contain fixed-sized data structures that change only between revisions of the format. These features are not specifically designed to detect or correct file or data errors, but they can be used in that way and are better than nothing.

At least one format, PNG, does include an active error-checking mechanism. PNG is a data stream format that comprises a small signature of byte values followed by three or more chunks of data. Each of these chunks may store a 4-byte CRC-32 value calculated from the data in the chunk. A PNG reader can calculate the CRC of the data and then compare this value to the one stored in the chunk. If the values do not match, the reader can assume that the data in that chunk is corrupt.

The PNG signature is also unique in that it contains several characters used to detect whether the file was improperly processed by a 7-bit data channel or a text processing filter. (See the PNG article for more information.)

For formats that don't provide any real error checking, you might consider storing files using a file archiving program that offers error checking as a feature. Many archivers, such as pkzip and zoo, perform a CRC calculation on each file that they store. When the file is removed from the archive, the value is recalculated and compared to the stored value. If they match, then you know that the file has not been corrupted. Archiving your graphics files is especially recommended if you are sending them over a data communications network, such as the Internet.

Another type of external error-detecting mechanism is a digital signature. This is a method of detecting whether changes have occurred within a block of information. We discuss digital signatures in the context of graphics file encryption in the section that follows.



Copyright © 1996, 1994 O'Reilly & Associates, Inc. All Rights Reserved.