Skip to main content

Research Data Management Toolkit: Documentation and Metadata

This guide provides information about research data management and the Tropical Data Hub (TDH) Research Data repository

Introduction: Documentation and Metadata

Documenting your research study and data is required at various stages in the Research Data Management Lifecycle. For example, you will need some form of documentation:

  • to prepare and update a data management plan
  • during the project in order to stay organized
  • on completion for your own recollection and re-use, and for sharing informally with colleagues
  • to include in reports to funders, technical reports, working papers or publications
  • before archiving data to ensure it is preserved correctly
  • in order to publish or share your data

You will need to create metadata (a subset of documentation) if you plan to publish or share your data. 

 Metadata ensures your data is discoverable and that others can interpret/validate, re-use and cite it correctly.

Good metadata includes data-level documentation as well as study-level documentation - it should not just describe the project or a publication.

 Data descriptions

   The description is the most important part of your data record. It should cover:

  • Why the data was collected to provide context to your data
  • What the dataset consists, for example:
    • ‚Äčtypes of files (e.g. transcripts, recordings, spreadsheets etc.), file structures, formats and any relationships between them
    •  data variables e.g. how they are coded and their units. This information could be included in a codebook, README.txt or other supporting documentation (and loaded as an attachment) or embedded in the data itself instead - see the 'Where is Metadata Stored' box on this page
  • How the data was collected and processed, for example:
    • how subjects were selected/rejected, assigned to treatments/controls, how instruments were calibrated, how measurements were taken
    • how data was processed, analysed, cleaned etc. (R scripts can be loaded as attachments)
   If your description lacks sufficient detail a data librarian will contact you for further information before publishing it.



Hieroglyphics image

Study-level documentation

Study-level documentation for data is often included in data management plans and provides a high-level overview and context for the data. It is an important component of the metadata used to describe data and is key to enabling secondary users to make informed use of shared data. Some systems integrate DMPs and metadata collection so that researchers don't have to re-enter this data!

Together, study-level and data documentation answer the why, how, when and who questions for your data. They overlap to some extent, but  good study-level data documentation would include:

  • the context of the project: its history and funding, aims, objectives, hypotheses, spatial and temporal coverage etc.
  • personnel: creators, data owners (IP) and custodians, roles and responsibilities and contact details
  • data collection methods: protocols, sampling design, work flows, instruments, hardware and software used 
  • subject descriptions: keywords, Fields of Research, Socio-Economic Objective codes, discipline-based vocabulary terms
  • structure of data files and the relationships between them
  • quality assurance: calibration, validation, cleaning or other QA processes carried out on date files
  • data provenance: origin and history of the data, use of existing datasets, modifications made over time and identification of different versions
  • access: conditions for access and use or details regarding data confidentiality, licensing arrangements
  • references to publications or other research outputs

Data-level documentation

While it may be tempting to stop at the study-level metadata needs to include data-level documentation as this is critical for validating, reproducing and re-using data. It could include (as applicable):

  • names, labels and descriptions for variables
  • definitions of codes and classification schemes used
  • definitions of specialised terminology or acronyms
  • codes and reasons for missing values (see also the Data Wrangling section of the Toolkit)
  • code and scripts used to derive data after collection (simple derivations such as grouping by age levels can be explained in variable and value labels)
  • weighting and grossing variables created

Data-level documentation may also be embedded in the data itself as explained in the 'Where is Metadata Stored' box on this page.

Source: These guidelines have been adapted from material prepared by the UK Data Service and listed in the Resources section of this page.

The terms "data documentation", "data provenance" and "data lineage" are often (understandably) confused. Definitions vary, but they could be considered as a continuum, with data documentation at the broadest level. According to the RDA Research Data Provenance IG provenance is concerned with questions of data origins, maintenance of identity through the data lifecycle, and how we account for data modification. The Data Wrangling Handbook v0.1 likens this to the chain of custody in criminal investigations e.g. previous owners have to be identified and held accountable for the processing and cleaning operations they have performed on the data! Technical data lineage relies on metadata that tracks data flows on the lowest level - tables, scripts and statements etc.

Where is Metadata Stored?

Metadata can be stored in local systems with the data it is about - or in data or metadata stores when it is complete. The Tropical Data Hub (TDH) Research Data repository is an example of an institutional metadata store and contains records for datasets generated by JCU HDR students and researchers. Metadata records are harvested regularly and published by Research Data Australia. It also provides secure storage for datasets which (unless restricted) are accessed directly via the catalogue or by negotiation with the data custodian.

Data-level documentation/metadata such as workflows, detailed methodologies, variable descriptions, codes and units are often stored with the data (embedded) or included in their own data file e.g. codebook, README text etc. as supporting documentation. 

Embedded documentation can be as simple as a key in a MS Excel spreadsheet (an additional worksheet) or it may be more complex e.g. for software packages that include facilities for data annotation as variable attributions, table relationships etc. If possible export this as plain text and include it with your supporting documentation.


Some of these resources have been developed to assist researchers in the social sciences. Data documentation is (arguably) more difficult to complete in some disciplines than others (HAAS vs. STEM) but it is critical for trust, credibility and reproducibility in any research area:

This ANDS webinar covers data provenance, the Data Documentation Initiative (DDI) and the C2Metadata Project (automates capture of metadata describing variable transformations) being undertaken at ICPSR. Quite technical but may be of interest to social scientists and data managers (54 min.)

    return to Toolkit Contents

We acknowledge the Australian Aboriginal and Torres Strait Islander peoples as the first inhabitants of the nation and acknowledge Traditional Owners of the lands where our staff and students, live, learn and work.Acknowledgement of Country