README,Write Me! Digital Project Organization and Documentation

Overview

This workshop covers best practices for organizing and documenting your digital projects for robust, reproducible research. Topics covered include data documentation, forms of metadata at both the data and project level, and best practices for organizing file directories and recording metadata. After this workshop learners will be able to create a readme for their data-driven project directory, design a data dictionary (or codebook) for a dataset, and describe how to use a workflow diagram to represent their data gathering, cleaning, and analyses methodology. This workshop includes a studio portion where learners are guided through putting principles into practice to begin documenting their own research projects.

Learning Objectives

After this workshop, learners should be able to:

Define metadata
Describe the importance of collecting metadata / data documentation
Identify the different ways metadata can be recorded at the project and dataset level
Locate a relevant metadata standard for their field
Produce a data dictionary / codebook for a tabular dataset
Document the organization of a data-driven project using a README
Represent how their data is processed and analyzed using a workflow diagram

Background: Data Documentation

Why document?

Detailed documentation of datasets and data analysis workflows helps other researchers, including collaborators, find and use your data more easily. Strong documentation also enables reuse and reanalysis of your data, which increases research reproducibility and scientific integrity (NASEM 2019).

Consistent documentation practices will also help you find, use, and re-analyze your own data in the future. Have you ever put a project on the backburner for a few months, only to return and be confused as to what was in what dataset, which dataset file was up to date, or what analyses you used? Just as hardware fails and information can be lost over time (aka bit rot), we are only human and can’t be expected to remember everything. The diagram below illustrates how information about datasets or projects are lost over time as researchers become less and less familiar with the inner workings of a particular project. While the figure revolves around issues across a researchers’ career, information is also lost during the shorter tenure of our graduate and post-doctoral careers as we juggle multiple projects, often move between machines and locations, and lose space in our “mental storage.” Keeping well-documented data and projects is an antidote to these issues.

(from Michener et al 1997)

Documenting your data and project consistently saves you time and avoids headaches. It also importantly increases data accessibility, transparency, and reproducibility in your field and research at large. Having robust documentation workflows is key to scientific integrity in the digital age.

Defining Metadata

Metadata, as its name suggests, is data about your data. It is information about information. More specifically, the National Information Standards Organization defines metadata as “structured information that describes, explains, locates, or otherwise makes it easier to retrieve, use, or manage an information resource.” Whenever we document our processes, describe the contents of our datasets and variables, or record analysis details, we are collecting metadata.

Metadata in the Data Lifecycle

(Image from USGS with modifications.)

When considering the the data lifecycle, it is apparent that data documentation should be created and continually updated throughout all stages of a research project. We should consider how we will collect metadata when planning our data-driven projects, record metadata along with our actual data collection, and describe our data processing and analysis worfklows. We need to consider and document our plans for the long-term storage of our data, related publications, reports, or archives along with our dataset for future users.

(To learn more about the data lifecycle, see this resource.)

What should be included in metadata?

Metadata can be broken down into three general categories: descriptive metadata (who created or collected the data and when, the subject of the data, and what variables it measures and includes); structural metadata (how the datasets or projects are organized, if and how datasets relate to each other, and how they were processed or analyzed); and administrative metadata (contact information for all data and/or code originators, criteria for access and re-use rights, and where the data are shared, published or archived).

Minimally, your metadata and associated project documentation should be organized to easily answer the following questions:

Descriptive

How, where, and about what was this data collected? What processes were used for creating the data?
What do the values in the table(s) refer to?
What are the units of the variables? What possible levels or range can they have?
How is missing data denoted?

Structural

What is the file naming convention or scheme in my project directory?
What were the criteria for including the data in downstream analyses?
How was the data analyzed?
What software do I need in order to read the data?
What is the process for updating the data going forward?

Administrative

Where can the daa be accessed, and by whom? Can I give these data to someone else?
Why were the data created? Who were the data created for? Are there any funders to acknowledge?
How should the data be cited?
Where can I find relevant publications?
Who should people contact if they have questions about the data?

More questions that can help you decide what to include in your metadata and overall project documentation can be found here: Metadata Questions in Plain Language (USGS).

Tools for planning and documenting your project

Data documentation that accompanies your datasets and resides within your project directory are important and are the main focus of this workshop. As you answer the above questions, you may find yourself considering decisions about how you should organize, store, and structure your data-driven project. You need a data management plan!

The DMPtool helps researchers create a Data Management Plan (DMP). Among the topics included in this plan are guidelines for how you will document and collect metadata in your project, as well as other questions related to structural and administrative metadata. Many funding agencies, including the National Science Foundation (NSF), are now requiring DMPs for grant proposals. Creating a DMP can also signal your commitment to more open, reproducible research. DMPs can be included in study pre-registration, such as on the Open Science Framework, where researchers provide details as to methods, hypotheses, predictions, and data management before conducting a study. It’s never too late to create a DMP, but as with most project the earlier you start planning the better. Learn more about the DMPtool and DMP resources from the UC Davis Library here.

Levels of Metadata

In this workshop, we will focus on documenting metadata at three levels within a research project:

Project level: File directory contents and organization

Documentation format: README file
- How is your project organized in folders and files?
- What are the file naming conventions?
- What is contained in each of the separate files?

Data level: Variables and units

Documentation format: Data dictionary and/or codebook
- What are the variables in this dataset?
- What are the possible levels or values, and units for each variable?
- How are missing values denoted?

Project level: Workflow processes

Documentation format: Workflow diagram
- How was data cleaned, processed and/or analyzed?
- What code or scripts were used to do this?
- How do different datasets relate to one another?

These three topics cover aspects of descriptive metadata (1-2) at various levels, and structural metadata (3) at the project level. We go into detail on each, provide templates and examples as a starting place below.

Note: You may encounter various different forms and organizations for metadata “in the wild.” We have organized this workshop in the above format to align with the complex nature of most research projects, which often contain multitudes of datasets, code files, and outputs. A single readme.txt document can suffice for smaller research projects that involve only one dataset, and only a few code files and outputs. Use the structure that meets the needs of your project and how you plan to use your data.

Project: File Directory Contents and Organization

At the highest level of your project is your project directory, or the folder(s) containing all the files of your project. All of your data, code, figures, and even manuscript drafts should be under the umbrella of the project directory.

READMEs

A README is a document that describes the organization and contents of your project, and serves as the roadmap to orient you - and anyone else who “drops in” to your project - to orient themselves.

At a bare minimum, your README should include:

Project title, investigator, institution and dates of project
Brief description of methods for data collection and generation (can include links to publications, databases, protocols, or other documentation with more details)
Brief description of data processing, if data are filtered or generated from the raw data
Brief description of what each folder and data file contains
Date each file was created
Overview of the file naming scheme (if present, see below)

Best practices for README files:

READMEs should

Be written in plain-text (.txt) or markdown (.md) so they are easily accessible by anyone (use a non-proprietary file format)
- For more information on markdown, see this guide
Live in the main directory of the project folder or file they describe
Be named so that it is clear what project / file it describes
Be formatted similarly across different projects
Use standardized formats, such as ISO 8601 for dates (YYYY-MM-DD)

For more information on README contents beyond these minimum requirements, as well as more format recommendations, see Cornell University’s guide to writing “readme” style metadata.

For some README inspiration, see these real world examples:

Example README.txt for a software survey project (University of Arizona, OSF)
- includes directory overview as well as analysis workflow description
README.md for the UC Davis DataLab’s 2020 CA Election Data Challenge winning team’s project
- uses data sourced from archives / government databases, describes analysis workflow
README.md for a 2014 Software in research survey.
- includes language on analysis pipeline and how to reproduce workflow
README.md for the UC Davis DataLab: Hack for California Carceral Ecologies Project

You’ll notice substantial variation in the formats and topics included, but the same types of information pertaining to data reuse are included.

Project Directory Structure

As you begin your README documentation, you may want to consider the organization of your project and directory. Do you have a consistent file naming convention or scheme? Is there a logical hierarchy to your folders?

Directory structure tips: While folder directories will vary across project types and researcher styles, there are some tried and true tips that cna help with your project organization. It’s best to include all files for your project in one single project folder, with subfolders. Raw, original data files can often be stored separately from processed or filtered datasets. Results or products (such as statistics, summary tables, or figures) can be stored in separate folders. Code and scripts can have their own folder, too. In large projects, you might also want to have an “archive” folder, where you stored zipped folders of past code scripts or data, so they can be accessed again, but are separate from the rest of the working directory. Whatever you choose, make sure there is some logic to the organization (this will come through when you explain it in your documentation) and that you are consistent.

This diagram shows an example of two ways you can organize a directory: 1) by workflow step, such as data files, code files, and outputs; or 2) by output, such as a result table or figure, and all the data and code required for that output is organized in a sub-directory. (Source: Library Carpentry)

Note: For researchers in STEM or other computational fields, GitHub repositories are a great place to get a sense of what the organizational norm is for your field. Often, a folder directory will become housed in a version-controlled repository, so repositories can serve as inspiration for how to structure your folder directories. You’ll often see a data folder, a code folder, an image or figure folder, etc. within a repository. For more on what a repository is and on version control with git, see this past workshop on version control

File Names

File names are one of the first places many of us store metadata. To better organize your project directory and document your metadata, we recommend using a consistent naming scheme. You can then record the general form of this file naming scheme in your README document, so that future users can easily navigate current files and generate new files that match the format.

Some possible schemes:

[YYYYMMDD]_[method]_[subject]_[authorLastName]_[version].[ext]
[YYYYMMDD]_[project-name]_[method]_[version].[ext]
[YYYYMMDD]_[version]_[subject]_[authorInitials].[ext]
[subject]_[author]_[YYYYMMDD].[ext]

And an example file name scheme in practice, from a bioinformatics study:

(from Reiter et al 2021) As you can see, as file names become more complex, using a consistent scheme and documenting it is KEY to data interpretability!

In terms of format, file names are best if they are human-readable, machine-readable, and work well with default ordering. Briefly, this means you as a user should be able to get some metadata from the file name, the file name avoids spaces and special characters, and ideally has some date or version number that can be used to sort files and versions. You can read more about these file-naming best practices here. As with file directories, there is no one-size-fits-all solution. That said, you can tailor your names to suit your needs and field of study, and as long as you are consistent and documenting your naming scheme, you are doing it right!

Datasets: Variables and Codes

At the next level, after project organization and folders, lie the details of our data / datasets. If you are working with file types such as a set of images, photos, videos, where the data lies in separate files, you may want to make a sub-folder README, where the content of the folder is described, along with the origin and subject of the files, and the file naming scheme or convention. Depending on your use and project complexity, you can also include this information in the larger, project README, though often it is best to have the README file housed closely with the files it details.

Data Dictionaries / Codebooks

If you work with tabular or spreadsheet data, as many researchers do, you might need to collect metadata differently. In this case, there are arrays of variables, codes and entries that may need a key and may relate to one another. When working with this type of data, data dictionaries and codebooks can be useful.

Data dictionaries describe the variables of the dataset and their parameters, specifically for tabular data. Codebooks are very similar to data dictionaries and are commonly used in survey data, and may have a slightly different format (but the difference is mainly just field nomenclature).

Bare minimum of what a data dictionary or codebook should include:

List of variables / column names and a description of what they include
For each variable, the type (numerical? character/text?) and the units it should be entered in (e.g. meters, feet)
For each variable, the acceptable range or list of values
If a variable uses codes, what the codes refer to
- e.g., female = 1, male = 0; Likert scale: 5 = Strongly Agree, 4 = Agree, 3 = Neutral, 2 = Disagree, 1 = Strongly Agree.
- codebooks usually go into great detail into these levels for extensive surveys and demographics.
How missing data is coded across the dataset or in specific variables

Best practices for data dictionaries and codebooks:

Similarly to READMEs, data dictionaries should be:

Stored in open, non-proprietary formats. Data dictionaries, like the data they describe, may be easily formatted in a tabular format. Some researchers using a spreadsheet manager like Excel might place them in a separate tab in the spreadsheet, but we encourage you to save this tabular data separately as either a .txt or .csv file. This will make the data more accessible to future users who may not have access to your proprietary software. You can also store the information in a list, non-tabular format in a stand-alone .txt or .md file, or within a larger readme. Whatever you choose, make sure to be consistent!
Live in the same directory / folder with the file it describes
Be named so that it is clear what project / file it describes (e.g. [dataname]_dictionary.[ext] or [dataname]_documentation.[ext])
Be consistently formatted across different datasets within a project

Some examples of data dictionaries and codebooks:

This diagram illustrates how a data dictionary may relate to an actual dataset. Source

An example codebook from a larger survey. It explains what the numerical codes mean and how they may connect to codes used in other surveys. This codebook could be improved by further explaining the names / information stored in columns and indicating missing values. Source

More examples:

Extensive codebook for the General Social Survey national survey run by the University of Chicago

Project: Workflow Relationships

Once we have covered some of the descriptive metadata (the WHAT) detailing our project directory, file organization and our datasets, we can zoom out again and describe how it all relates, and how data is processed and analyzed. This structural metadata (the HOW) covers our workflow and processing/analysis “pipeline”.

Workflow and structural metadata can easily be included in your overall project README file. You can include a section that describes in text how datasets relate to each other, and how data was processed and analyzed from raw to create results, like statistics tables and figures.

Some questions to consider when describing these processes:

How was data processed or filtered from the raw data? What methods were used, and if done programatically using code, what are those scripts called?
What was your analysis workflow? What software or programming languages were required, for which datasets?
How were any results files, such as summary tables or figures, produced? What scripts, if any, were used?

Workflow Diagrams

If your analysis and data processes are more complicated than just one code script, or you are working with multiple different datasets, sometimes it can be helpful to display this information graphically using a workflow diagram (also called a data-flow diagram). This can serve as a “map” laying out the workflow of your project, and can add more visual elements to your project documentation. It’s a good idea to explain your workflow verbally in text too, which can help your process be accessible in multiple modalities as well as help you outline your process before creating a visual diagram.

Tips for creating a workflow diagram:

Use shapes to represent files, such as datasets, scripts, figures, tables or other output.
Connect shapes with arrows to indicate the flow of information through files from data input to result output.
Include a key if you use multiple types of shapes or colors for different file types (i.e., data vs. scripts)
Be consistent in the structure of workflow diagrams across projects
Save a version of your workflow diagram as an image, like .png or .jpg so that it can be accessed/viewed by others without proprietary software

These diagrams can be made in software like Microsoft Powerpoint or Google Slides, and even basic image editors if you stick to standard shapes. Other options include online tools like Lucidchart, Drawio or others (list here) can be optimal for creating data documentation workflows, and some even allow export of the code needed to create a database based on a data structure mapped with that program. Stripped down free and academic/educational license versions of the proprietary workflow software are often available and sufficient for most researchers, but can be limiting for highly complex projects.

Some workflow diagram examples:

Simple workflow diagram for a bioinformatics project (microbial sequencing)
Workflow diagram for a transcriptomic study (biology/neuroscience; Harris et al 2019)
Workflow diagram for another bioinformatic study (Harris et al 2020)

Metadata standards and schemas

In this workshop, we have discussed these simple forms of metadata to start the process of documenting your project in relatively straightforward, approachable ways. This is an important initial step for researchers to start good data documentation practices and make sure their data is interpretable and reusable.

As it has become clear, however, the formats of this metadata can vary a lot between researchers and research groups (even when we include the minimum information for reuse). Often, as projects become more complex, add on more collaborators, agencies, and institutions, or become affiliated with funding agencies or data depositories, more regimented metadata is needed. Researchers often turn to standards-based metadata, and additionally structured metadata.

Structured metadata often involves using specialized tools to create machine-readable, standardized metadata documents that align with a discipline-specific format. These documents are often written in their own special format, such as eXtensible Markup Language (XML), and are sometimes called metadata schema. Researchers format these metadata documents to align with discipline specific metadata standards set by agencies, data centers, or common data depositories in that field. For instance, the Dublin Core is a relatively general metadata standard applicable to many fields, while the [Data Documentation Initiative standard] (https://ddialliance.org/) is widely used in social and behavioral sciences. For spatial data, the standards are set by the Federal Geographic Data Committee.

We encourage you to find the metadata standard that is appropriate for your discipline and your research so that you can learn more in the future. Here are some searchable lists of metadata standards by field:

Can’t find the metadata standards for your field or data type? Aren’t sure where to start? UC Davis researchers can reach out to a librarian for help..

Principles to Practice

In the studio section of the workshop, we will begin fleshing out a README file and, if applicable, a data dictionary/codebook and workflow diagram for your own data-driven project.

We have put together templates to serve as a starting place. Please refer to the additional resources to see examples of the diversity of these documents as well as recommendations for more extensive documentation.

Let’s get documenting!

Templates

README template: markdown or plain text (.txt)
- based off of the Cornell and University of Arizona templates
Data dictionary template: plain text or tabular (.csv)
- based off of the Open Science Framework recommendations
Workflow diagram template: powerpoint file (.ppt)

Examples of Documentation Using Templates

In the workshop demo, we gave an example of how to build out documentation (a README, a data dictionary, and a workflow diagram) using these template documents. The example dataset revolved around the Bechdel test (a test to see if women are well-represented in film, from this original cartoon, “The Rule”, by Alison Bechel), and whether movies that pass the Bechdel test have a higher or lower return-on-investment than movies that do not pass. This was an extension of the FiveThirtyEight analysis shown in [this article]((https://fivethirtyeight.com/features/the-dollar-and-cents-case-against-hollywoods-exclusion-of-women/) (accessed 2021-03-04). The original data set can be accessed at: https://data.world/fivethirtyeight/bechdel.

Additional Resources

UC Davis

Interested in the DMPtool, more structured metadata and metadata standards? The UC Davis library offers researchers consultations related to data management. Learn more here.
To learn more about keeping data tidy and tips and tricks in Microsoft Excel, see DataLab’s past workshop materials here.
For more information on version control and GitHub, see DataLab’s past workshop recording (Fall 2019) and materials from winter 2021 here

Useful Links

Metadata Worksheet | Harvard Biomedical Data Management
- helps walk you through plans for metadata collection and documentation
Metadata questions to ask in plain language | USGS
Data documentation: Difference between readmes, data dictionaries, and codebooks | University of Wisconsin Madison
Guide to writing “readme” style metadata | Cornell University
README checklist | Harvard Biomedical Data Management
- great checklist to evaluate the utility of your README when you are done!
GitHub Repo README template
- template for GitHub repositories, specifically
How to make a data dictionary | Open Science Framework
Ontology Database for the Biological Sciences
- Ontologies cover controlled vocabularies for different fields, such as how to name genes or gene products, for example.
Metadata tools | Stanford Libraries
- List of tools and plug-ins that can help create more standardized and structured metadata

Data Used in Workshop

The data visualization for the in-workshop activity on the top dog breed comes from Information is Beautiful
Data for the studio-section demo comes from [FiveThirtyEight’s analysis of Hollywood movie budgets and representation of women]. Datasets were sourced from data.world