Documenting your research study and data is required at various stages in the Research Data Management Lifecycle. For example, you will need some form of documentation:
You will need to create metadata (a subset of documentation) if you plan to publish or share your data.
Metadata ensures your data is discoverable and that others can interpret/validate, re-use and cite it correctly.
Good metadata includes data-level documentation as well as study-level documentation - it should not just describe the project or a publication.
The description is the most important part of your data record. It should cover:
If your description lacks sufficient detail a data librarian will contact you for further information before publishing it.
Study-level documentation for data is often included in data management plans and provides a high-level overview and context for the data. It is an important component of the metadata used to describe data and is key to enabling secondary users to make informed use of shared data. Some systems integrate DMPs and metadata collection so that researchers don't have to re-enter this data!
Together, study-level and data documentation answer the why, how, when and who questions for your data. They overlap to some extent, but good study-level data documentation would include:
While it may be tempting to stop at the study-level metadata needs to include data-level documentation as this is critical for validating, reproducing and re-using data. It could include (as applicable):
Data-level documentation may also be embedded in the data itself as explained in the 'Where is Metadata Stored' box on this page.
Source: These guidelines have been adapted from material prepared by the UK Data Service and listed in the Resources section of this page.
The terms "data documentation", "data provenance" and "data lineage" are often (understandably) confused. Definitions vary, but they could be considered as a continuum, with data documentation at the broadest level. According to the RDA Research Data Provenance IG provenance is concerned with questions of data origins, maintenance of identity through the data lifecycle, and how we account for data modification. The Data Wrangling Handbook v0.1 likens this to the chain of custody in criminal investigations e.g. previous owners have to be identified and held accountable for the processing and cleaning operations they have performed on the data! Technical data lineage relies on metadata that tracks data flows on the lowest level - tables, scripts and statements etc.
Metadata can be stored in local systems with the data it is about - or in data or metadata stores when it is complete. The Tropical Data Hub (TDH) Research Data repository is an example of an institutional metadata store and contains records for datasets generated by JCU HDR students and researchers. Metadata records are harvested regularly and published by Research Data Australia. It also provides secure storage for datasets which (unless restricted) are accessed directly via the catalogue or by negotiation with the data custodian.
Data-level documentation/metadata such as workflows, detailed methodologies, variable descriptions, codes and units are often stored with the data (embedded) or included in their own data file e.g. codebook, README text etc. as supporting documentation.
Embedded documentation can be as simple as a key in a MS Excel spreadsheet (an additional worksheet) or it may be more complex e.g. for software packages that include facilities for data annotation as variable attributions, table relationships etc. If possible export this as plain text and include it with your supporting documentation.
Some of these resources have been developed to assist researchers in the social sciences. Data documentation is (arguably) more difficult to complete in some disciplines than others (HAAS vs. STEM) but it is critical for trust, credibility and reproducibility in any research area:
This ANDS webinar covers data provenance, the Data Documentation Initiative (DDI) and the C2Metadata Project (automates capture of metadata describing variable transformations) being undertaken at ICPSR. Quite technical but may be of interest to social scientists and data managers (54 min.)
We acknowledge the Australian Aboriginal and Torres Strait Islander peoples as the first inhabitants of the nation and acknowledge Traditional Owners of the lands where our staff and students, live, learn and work.