Introduction

Documentation and metadata are essential to understand your data in detail and help other researchers find and re-use your data. You can provide information about the dataset's content, the context in which the data were captured or created, and the origin of the dataset. Documentation and metadata are essential in making your data FAIR:

Findable - keywords, topics, DOI, controlled vocabularies, repository metadata standards, metadata exchange
Accessible - metadata is open, wherever possible, even if the data aren’t
Interoperable - standards, ontologies and controlled vocabularies
Reusable - describe your data so other researchers can understand your data

Metadata

Metadata is data about data. It plays an essential role in making your data FAIR. Metadata should be continuously added to your research data, not just at the beginning or end of a project. Metadata can be added manually or automatically, preferably according to a disciplinary standard. From a FAIR perspective, metadata is more important than your data because metadata should always be openly available and link research data and publications in the Internet of FAIR Data and Services. While data documentation is meant to be read and understood by humans, metadata (sometimes a part of the documentation) is primarily meant to be processed by machines.
There are three main types of metadata:

Administrative metadata - data about a project or resource that is relevant for managing it; for example, project/resource owner, principal investigator, project collaborators, funder, project period, etc. They are usually assigned to the data before you collect or create them.
Descriptive or citation metadata - data about a dataset or resource that allows people to discover and identify it; for example, authors, titles, abstracts, keywords, persistent identifiers, related publications, etc.
Structural metadata - data about how a dataset or resource came about and how it is internally structured. Structural metadata describes, for example, the unit of analysis, collection method, sampling procedure, sample size, categories, variables, etc. Structural metadata should be gathered by the researchers according to best practices in their research community and be published together with the data. Descriptive and structural metadata should be added continuously throughout the project.

An excellent way to determine what metadata to capture is to think about the information someone else would need to understand your project and reuse the data. Many disciplines, repositories, or data centres use metadata standards or schemas. A metadata standard is a defined set of metadata fields that can be general or discipline-specific.

Metadata Standards

There are many metadata standards. General, field, and institutional standards can be specified. The general metadata standards are Dublin Core and DataCite, or the Data Documentation Initiative (DDI). They are universal in the field and widely used. Selected metadata standards are also used in various fields and institutions, e.g., DC (life sciences), EML (ecology), SDMX (ECB, EUROSTAT, IMF, OECD, UN), SAFE (ESA), INSPIRE ISO 19139 (earth sciences), Project Open Data Metadata Schema v1.1 (US federal agencies), TEI and CDWA (humanities disciplines). Metadata standards specific to various disciplines can be found on the Data Curation Centre website.

Documentation

Documentation is also needed to describe the data correctly. It includes contextual and descriptive features of the data and all the information someone needs to understand the data to be able to use it. It’s significant at the dataset level (e.g. describing how the data were created) but also at the level of individual data elements (e.g. explaining what each variable means or the parameters for generating data files such as images). Examples: protocols, codebook (with an explanation of concepts, names, variables, and abbreviations), lab journals, code explanation in the file, methodological information, and information about the structure of a dataset.

Readme

A README file is plain text with descriptive information commonly used for software, games, and code. It is a supplementary document so the creator can explain the contents to the user. When working with data, creating and including a README file with your data can be useful. This ensures that future users understand the data, terms, and more.

README files are a simple way to create documentation for a dataset,

it should make your data understandable and usable,
it should provide the data context, i.e. to which research project the data belongs, and how it should be interpreted,
it should allow users to understand how a dataset relates to others.

The README file must be accessible at the exact location of the data. The README file's name must clarify which file or dataset it describes. The location of the README file in the folder structure can also show which dataset it belongs to. You can provide one README file for an entire research project.

It is essential to mention information on the context in which the data were gathered, the origin of the data, and the dataset's content. The person who opens the file has to understand what the dataset is about. You should also include any technical information needed to open a dataset, for example, the required software or specifications about file formats. To make it clear, you should explain the dataset's structure or what the abbreviations mean. Information about rights on the dataset or the confidentiality of data is best explicitly mentioned in the README file. The person opening the dataset must be aware of the legal implications of its use.

Written by Anna Wałek

Ph.D, President of IATUL, Open Science and RDM expert ACC Cyfronet AGH

How do you adequately describe your research output

Introduction

Metadata

Metadata Standards

Documentation

Readme