File management – Cornell Data Services

File organization and naming conventions are often unique to the lab and can be highly personalized. The important thing is to be consistent and to write the conventions down.

Discuss and decide file management strategies early in the project planning process (e.g. lab groups or project collaborators should agree on how to name and organize files).
Write it down and save to a shared space so that everyone follows the same conventions.
Revise the strategy if needed.

Best practices

Directory structure naming conventions

Do not rely on directory structure to provide critical information about the file contents.
Directory top-level folders should include the project title, unique identifier, and date (year), but the files themselves should be well-described independent of the directory structure.
Consider creating a brief description of the contents of major folders and providing an overview of the directory structure. This can be a text document or readme file that is stored in a top-level folder or shared space. The level of detail to strive for is enough to help someone else understand the contents and organization of your files in your absence.

Example:

Top folder: cornell_study
- Subfolder 1: cornell_study_datasets
  - cornell_study_2019-2020.csv
  - cornell_study_2021-2022.csv
- Subfolder 2: cornell_study_semantic_analysis
  - cornell_study_semantic_analysis.R
  - cornell_study_semantic_analysis_output.csv
- Readme file: cornell_study_readme.txt

Love your current file directory structure scheme? Contact us about it! We always appreciate seeing examples that work well for researchers.

(Adapted from the University of Illinois at Urbana-Champaign’s February 2022 Data Nudge.)

File naming conventions

Make file names unique, including the most important identifying information of the project. File names should be short, so do not try to include all of these, but elements of a good file name may include:
- project name, acronym, or research data name
- study title
- location information
- researcher initials
- date (consistently formatted, i.e. YYYYMMDD)
- version number
Use underscores to separate elements; avoid special characters, spaces, and periods other than the one before the file extension. Dashes are also acceptable, especially when working with HTML files.
Use leading zeros when incorporating numbers to enable sorting (a sequence of 1-100 should be numbered 001-100).
File names should be short enough to be readable, while still conveying enough pertinent information. While some operating systems can handle file paths (file name AND directory route) more than 255 characters in length, many tools and operating systems impose shorter limits. For this reason, it is generally advisable to keep file names as short as possible, ideally no more than 32 characters. ThisIsJustWhatThirtyTwoLooksLike.txt

Good filename example: DV_ICPOES_20101115_JDSv2.dat

Bad filename example: my Data @DryValley November 15 2010.v2.dat

DV is the site code (Dry Valley), ICPOES is the instrument from which the data originated, 20101115 is the date of the sample run on the instrument, JDS are the initials of Jane Doe Scientist and this is the second version of the data file.

Keep track of versions (version control)

It is important to keep track of versions when working with data. There are many benefits, most importantly the ability to revert data to an earlier version instead of starting from scratch or worse, having to regenerate data. There are three basic ways to keep track of versions.

Some tools, like Electronic Lab Notebooks or Box assign version numbers.
Use a naming scheme (e.g. v001, v002).
Version control software.

Some best practices for working with versions include:

Save an untouched copy of the raw, original data, and leave it that way. Always work on a copy of the “safe” untouched data (make a new copy in order to start from scratch if necessary).
Avoid ambiguous labels, such as ‘revision’, ‘final’, ‘final2’, etc. Instead, use a file naming convention (like v001, v002 or v1_0, v1_2, v2_0).
Use a directory structure naming convention that includes version information.
Use tools that automatically assign version numbers to manage the data. It is a good idea to test this method to make sure that it is possible to revert to earlier versions and that the tool functions as expected.
If appropriate, use version control software (such as SVN, or Git). A coding project may be a good candidate for using version control.