Glossary – Cornell Data Services

This glossary provides definitions for select terms on this website that might be unfamiliar. This is not meant to be a comprehensive list.

Term	Definition
anonymized data	Data about individuals that does not reveal their identities nor links to other data that would reveal their identities. This term is not the same as “coded data” or “de-identified data.” Learn more about the differences between anonymized data, coded data and de-identified data (Office of Human Research Ethics at The University of North Carolina at Chapel Hill, 2020).
archive (verb)	The transfer of material to a facility that appraises, preserves, and provides access to that material on a long-term or permanent basis. See “repository” to learn more about archiving facilities.
attribution	The act of referencing a dataset’s original creator, who chose an open copyright license. Note the difference between citation and attribution as described by Aesop in the BCcampus Open Education Self-Publishing Guide.
code	Computer code or scripts. In the context of data management, this may include code used in the collection, manipulation, processing, analysis or visualization of data, but may also include software developed for other purposes.
coded data	Data accompanied by a linkage file that connects unique identifiers about an individual (such as their name, medical number, email address or telephone number) with a unique study ID code not associated with the individuals’ personal information. The study ID may be a unique string of numbers and/or letters, such as ST01, ST02, ST03, and so on. This term is not the same as “anonymized data” or “de-identified data.” Learn more about the differences between anonymized data, coded data and de-identified data (Office of Human Research Ethics at The University of North Carolina at Chapel Hill, 2020).
confidentiality	The right of privacy and of non-release of disclosed personal information. Applies to data collected on human subjects. Researchers may be subject to legal requirements to prevent the release of private, personally identifiable information provided by research subjects.
copyright	A set of legal rights extended to copyright owners (the author or creator, or other party to whom the rights have been assigned) that govern such activities as reproducing, distributing, adapting, or exhibiting original works fixed in tangible form. Copyright does not apply to factual information; as a result it does not apply to data. For more on copyright and data, see our Introduction to Intellectual Property Rights in Data Management
data	Data subject to data management planning requirements may be defined differently by different funders, programs, or research communities. Data can be defined as: “the recorded factual material commonly accepted in the scientific community as necessary to validate research findings” (OMB Circular A-110 from 1999). In developing a data management plan, researchers should consider which data would be required to verify their results and which data would have the highest potential and value for reuse by others.
de-identified data	Data for which all direct or indirect identifiers or codes linking the data to individual subjects’ identities are destroyed. Researchers can de-identify data by by removing all identifiers or codes from the dataset or destroying the linkage file so that no data can be traced back to an individual. This term is not the same as “anonymized data” or “coded data.” Learn more about the differences between anonymized data, coded data and de-identified data (Office of Human Research Ethics at The University of North Carolina at Chapel Hill, 2020).
derivative (of data)	Any data, publication, illustration or visualization, or other work that rearranges, presents, or otherwise makes use of an existing data set.
FAIR Principles	Acronym for four key qualities of managing digital assets: Ensuring that they are “Findable, Accessible, Interoperable, and Reusable.” Originally published in “FAIR Guiding Principles for scientific data management and stewardship” in Scientific Data (2016). Learn more about preparing FAIR data for reuse and reproducibility.
file format	The particular structure used for encoding data in a computer file. File formats are usually identified by the file extension (e.g. .xlsx, .csv, .dbf). File formats may be proprietary or open (with a readily available specification or description of the format). Open file formats usually maximize the potential for reuse and longevity. For more information about preservation and file formats, see our best practices guidance on file formats, or more in-depth guidance from Cornell’s Institutional Repository: Recommended File Formats for eCommons.
institutional repository	A service storing and providing online access to digital content. Content is typically produced by the institution that hosts the service. To learn more about Cornell’s institutional repository see our guide to eCommons.
intellectual property	Rights applied to creative works, including (but not limited to) copyright, patents, trademarks, and trade secrets. For more information, see our Introduction to Intellectual Property Rights in Data Management
license	In the context of data management, a legal instrument that expresses the terms of use of a data set. For more information, see our Introduction to Intellectual Property Rights in Data Management
machine-readable data	Structured data that can be easily processed by a computer. Making data machine-readable often requires cleaning and preprocessing raw research data.
metadata	Documentation or information about a data set. It may be embedded in the data itself, or exist separately from the data. Metadata may describe the ownership, purpose, methods, organization, and conditions for use of data, technical information about the data, and other information. Many metadata standards exist across a broad range of disciplines and applications. For more information, see our best practices guidance: Metadata and Describing Data.
open access	Typically used to describe publications, open access refers to online, freely available material that has few or no copyright or licensing restrictions. (Suber, 2004)
open data	Data that can be freely accessed used, modified and shared. Entails legal and technical permissions for users to access and modify the data. Can be subject to use or attribution requirements. (Open Data Handbook)
open source software	Software for which the source is available under an open license. Often available to users at no cost. Allows users to inspect the source code, run their own versions of the program, offer modification suggestions, fix bugs, develop new features, etc. Usually managed by one or more volunteers. (Open Data Handbook)
persistent identifier	A unique and long-lasting reference that allows for continued access to a digital object. Examples of persistent identifier systems include Digital Object Identifiers (DOIs), handles, and Archival Resources Keys (ARKs). Persistent identifiers support interoperability and the reliable citation of digital content.
preservation (of data)	Ensuring that data remain intact, accessible and understandable over time. This requires preserving the integrity of digital files themselves, and can be considerably more complicated. Preservation operations may include preserving the software required to interact with the data or emulating older systems, migrating data to new formats and new media, and ensuring there is sufficient metadata to understand, interpret, manage and preserve the data.
privacy	The protection of personal information from unauthorized access by others.
proprietary	A proprietary file format is one that a company owns and controls. Data in this format may need proprietary software to be read reliably. Unlike an open format, the description of the format may be confidential or unpublished, and can be changed by the company at any time. Proprietary software usually reads and saves data in its own proprietary format. For example, different versions of Microsoft Excel use the proprietary XLS and XLSX formats. (Open Data Handbook)
public access policy	Public Access policies ensure that the results of research are freely available to the public. This term is generally used by funders to refer to policies that align with the objectives of the OSTP memo, “Increasing Access to the Results of Federally Funded Scientific Research.” (Cornell University, 2025; SPARC, 2026).
repository	A facility that manages the appraisal, preservation, and accessibility to materials on a long-term or permanent basis. Learn more about sharing and archiving data as well as data curation services that support transferring data to institutional repositories.
restricted data or restricted access data	Data which are made available under stringent, secure conditions. Typically confidential or sensitive data.
security	Methods of protecting data from unauthorized access, modification, or destruction.
standards	Accepted methods or models of practice; these may be formally approved (as in NISO standards), or de facto standards. In the context of data management, standards typically apply to data or file formats, and to metadata.
tabular data	Data that appear in a table format, organized with specific relationships between columns and rows.
trade secret	Confidential information of commercial value. Trade secrets are exempted from the OMB’s definition of data (1999).
usage statement	An expression of the conditions under which a data set may be used. May be formal, as in a license or contract, or an informal expression of the preferences of the data owner(s).

More comprehensive lists of terms