RESEARCH DATA MANAGEMENT SERVICE GROUP
Comprehensive Data Management Planning & Services

File formats

"Working" file formats, those used in the course of collecting and working with project data, are not always ideal for re-use or long-term preservation, and may not meet the requirements of data archives or repositories or satisfy the expectations of research funders.

In the absence of specific directives from funders or repositories, we offer the following general guidelines for selecting file formats for preservation and reuse. eCommons@Cornell provides more detailed information in their support document, Recommended File Formats for eCommons.

Principles for selecting file formats

Select open, non-proprietary formats

Open, non-proprietary formats are far more likely to remain usable even if the software that created them is not available or no longer functional. Formats whose documentation is complete and freely available also have a higher likelihood of long-term preservation. If the program that created the file is the only option for reading or accessing the data, it is likely to be a proprietary, non-open format. As a general rule, plain text formats, such as comma- or tab- delimited files, are open formats and are typically better for re-use and long-term preservation.

  • Example of a proprietary format: Photoshop .psd file
  • Example of an open format: .tiff image file

Select "lossless" formats

Formats that compress the information in a file are often smaller, but the compression often permanently removes data from the file. These formats are "lossy," while formats that do not result in the loss of information when uncompressed are "lossless."

  • Example of lossy formats: .mp3 audio file, .jpeg image file
  • Example of lossless formats: .wav audio file, .tiff image file

Select unencrypted and uncompiled formats

If the encryption key, passphrase, or password to a file is lost, there may be no way to retrieve the data from the file later, rendering it unusable to others. Uncompiled source code is more readily re-usable by others and has a far greater likelihood of remaining usable over time since recompiling is possible on different architectures and platforms.