6  Best Practices for Writing R Scripts

NoteLearning Goals

After this lesson, you should be able to:

  • Describe the advantages and disadvantages of using scripts
  • Develop modular scripts that serve as applications and libraries
  • Inspect and manage R’s software (package) environment
  • Describe and use R’s for, while, and repeat loops
  • Identify the most appropriate iteration strategy for a given task
  • Explain strategies to organize iterative code
  • Identify and explain the difference between R’s various printing functions
  • Read user input from the command line into R
  • Read data from configuration files into R
Important

We also use slides to teach this lesson.

This is a highly interactive lesson, so the reader might not have notes for everything we cover live.

An R script is a text file that contains R code. R scripts usually have the extension .R. You might already use R scripts in your workflow (for example, if you learned R from DataLab’s R Basics Workshop Series). Even if you do, this chapter covers many different practices you can adopt to use scripts more effectively. Although the chapter is focused on R, many of the practices also apply to scripting other languages.

This chapter is all about R scripts: what they are, when to use them, and how to write them—with specific emphasis making sure your scripts are easy to read, extend, debug, and reuse. Topics covered along the way include package management, how to use iteration, how to print or log output, and how to read input from the command line or from configuration files. The chapter is organized around case studies (Section 6.3, Section 6.4) that benefit from many of the practices we recommend.

6.1 Tips for Projects

A well-organized project makes it easier to write well-organized scripts. This section briefly describes a few ways to keep your projects in order. The UC Davis Library’s Reproducible Research Guide provides many more tips for organizing research projects.

6.1.1 Create a Project Directory

Each time you start working on a new project, create a new directory on your computer. This is the project directory (also called a repository). Keep all of the files for the project in the project directory, so that you can easily find files and share or make back ups. Use subdirectories with standard, descriptive names to organize the files. Document the project with a README and a file manifest (a descriptive list of the files and directories).

NoteSee Also

You can read more about how to organize a project directory in the Planning for Success chapter of the Reproducible Research Guide.

6.1.2 Document the Computing Environment

A computing environment is a collection of hardware and software you use to run code. It’s important to document the computing environment where the code was written and tested, because it might not run correctly in other environments. R is designed to work seamlessly with a wide variety of hardware, so software, including R packages, is the primary concern.

You can view information about the computing environment in R with the built-in sessionInfo function:

sessionInfo()
R version 4.5.2 (2025-10-31)
Platform: x86_64-conda-linux-gnu
Running under: Arch Linux

Matrix products: default
BLAS/LAPACK: /home/nick/foundry/datalab/teaching/dwr/dwr_workshop_intermediate_r/.pixi/envs/default/lib/libopenblasp-r0.3.30.so;  LAPACK version 3.12.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: America/Los_Angeles
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] htmlwidgets_1.6.4 compiler_4.5.2    fastmap_1.2.0     cli_3.6.5        
 [5] tools_4.5.2       htmltools_0.5.9   rmarkdown_2.30    knitr_1.51       
 [9] jsonlite_2.0.0    xfun_0.56         digest_0.6.39     rlang_1.1.7      
[13] evaluate_1.0.5   

You can document the computing environment manually, but it’s usually better to use an environment manager, a software tool that automates recording and recreating software environments. The environment manager we recommend and use at DataLab is Pixi, which has a command line interface. If you prefer to work entirely within R, the renv package is an environment manager exclusively for R packages (which is quite limiting!).

NoteSee Also

You can read more about environment management in the Reproducible Computing chapter of the Reproducible Research Guide.

6.1.3 Consider Making a Package

Spend some time thinking about how the code in your project will be used. Will you need to run the code frequently? Will you need to run it as part of other projects or on other computers? Will other people need to run the code (would they benefit from being able to)? If the answer to some of these questions is yes, then it might be a good idea to turn the code into a package.

Developing a package can be a lot of work, especially if it’s your first time doing so. However, putting code in a package makes it substantially easier to distribute, install, and reuse.

NoteSee Also

You can read more about how to create an R package in R Packages by Wickham and Bryan.

6.2 Tips for Code

This section is a buffet of ways to ensure the R code you write is clear, organized, and effective. As you read, think about which are relevant to your project(s).

6.2.1 Write Outlines

Before you start writing code to solve a problem, make an outline. The outline should describe the problem you’re trying to solve and the steps to solve it in natural language. You can write the outline as comments in an R script or as an entirely separate document. Make sure to write down descriptions of all necessary inputs and intended outputs. Draw pictures if they help to clarify the details of the problem.

Outlining is a critical step in programming and problem solving. Don’t skip it! The outline will serve as a roadmap as you write the code.

6.2.2 Notebooks versus Scripts

Notebooks, which usually have the extension .Rmd, are the main alternative to scripts. Scripts and notebooks are both great ways to store code, and for most projects, DataLab staff use a mix of both. To decide which one to use, think about why you’re writing code. Compared to notebooks, the specific advantages of scripts are that they are:

  • Simple. Scripts are text files and usually only contain one programming language (in our case, R). Notebooks require familiarity with Markdown, and some also require special software.

  • Easy to reuse. In R, you can run all of the code in a script with the source function. This makes scripts a great way to create a library of reusable functions. Writing reusable scripts is also the first step towards developing packages to share with a wider audience.

  • Easy to run in a command-line interface. R provides a command-line program, Rscript, for running R scripts. Although you probably use a graphical interface to run code on your computer, high-performance computing environments do not always have a graphical interface, and even when they do, the default is usually a command-line interface. Running notebooks at the command-line is more difficult.

Scripts also have one big disadvantage compared to notebooks: results must be saved and viewed after the script runs, which means you have to plan ahead and write some extra code. As a result, we recommend scripts for code you plan to reuse or run at the command-line (especially if it will take a long time to run), and recommend notebooks for exploratory work.

NoteSee Also

You can read more about how to decide in the Reproducible Computing chapter of the Reproducible Research Guide.

6.2.3 Use Naming Conventions

Before you start writing code, choose conventions for naming variables, constants, functions, and other objects. If you’re working with a team, make sure the entire team is aware of and agrees with the conventions. Documenting the conventions in a style guide is often a good idea.

There’s no official style guide for R, but the Tidyverse style guide is likely the most popular unofficial style guide.

NoteSee Also

See the Reproducible Computing chapter of the Reproducible Research Guide for more about how name things in code.

6.2.4 Avoid setwd

When you set the working directory in a script, you typically have to make assumptions about where files and directories will be. These assumptions will not necessarily hold on other computers or for other users. A simple way to make your scripts more robust is to avoid the setwd function entirely when you write them. Treat setwd as a function that should only ever be used interactively (that is, typed into the R console).

NoteSee Also

The here provides an elegant solution for managing the working directory without using setwd or making assumptions about the layout of the file system.

6.2.5 Make Scripts source-friendly

You never know when some code you wrote might be useful in another context or might need to be adapted to fit into a larger pipeline. You can ensure your code is easy to resuse by writing all of your code in custom functions.

When you write a script, you’ll typically want it to do something when you run it. On the other hand, you might want to load the functions from a script for use elsewhere without actually running the script. The built-in source function loads and runs an R script.

You can design a script to do something when run from the command line (with the Rscript command) but not when loaded with source. This is the best of both worlds: you can use the script as a script or as library of functions. To do this, add code like this, called a guard, to your script:

if (sys.nframe() == 0) {
  # Code to run only when the script runs (e.g., with Rscript)
}

6.2.6 Handle Repetitive Tasks with Iteration

R is powerful tool for automating tasks that have repetitive steps. For example, you can:

  • Apply a transformation to an entire column of data.
  • Compute distances between all pairs from a set of points.
  • Read a large collection of files from disk in order to combine and analyze the data they contain.
  • Simulate how a system evolves over time from a specific set of starting parameters.
  • Scrape data from many pages of a website.

You can implement concise, efficient solutions for these kinds of tasks in R by using iteration, which means repeating a computation many times. R provides multiple strategies for writing iterative code:

  1. Vectorization, where a function is implicitly called on each element of a vector. See this section of DataLab’s R Basics reader for more details.
  2. Loops, including map and apply functions, where an expression is evaluated repeatedly until some condition is met. See [this chapter][] of DataLab’s R Basics reader for more details.
  3. Recursion, where a function calls itself.

Vectorization is the most efficient and most concise iteration strategy, but also the least flexible, because it only works with vectorized functions and vectors. Loops, map, and apply functions are more flexible—they work with any function and any data structure with elements—but less efficient and less concise. Recursion tends to be the least efficient iteration strategy in R.

6.2.7 Print or Log Status Information

Printing out messages is a great way to provide feedback as your code runs, and can also help you find and fix bugs. R provides several different functions for printing out messages. Choose the one that’s most appropriate for your message:

  • message: This is the primary way to print informational messages in R. It’s part of R’s conditions system, which means other code can detect the message if necessary (for instance, to save it to a file).
  • warning: This function prints a warning message. Like message, it’s part of R’s conditions system. Use warnings when you want to warn the user about something, but want the code to keep running.
  • print: This function prints out a string representation of an R object, intended for programmers. You can use it to help you fix bugs by printing out variables at various points in your code. Use message instead if you just want to print a message to the user.
  • cat: This function, whose name stands for “concatenate and print,” is a low-level way to print output to the console or a file. The message function prints output by calling cat, but cat is not part of R’s conditions system. Use a different function unless you’re sure cat is necessary.

Logging means saving the output from some code to a file as the code runs. The file where the output is saved is called a log file or log, but this name isn’t indicative of a specific format (unlike, say, a “CSV file”).

It’s a good idea to set up some kind of logging for any code that takes more than a few minutes to run, because then if something goes wrong you can inspect the log to diagnose the problem. Think of any output that’s not logged as ephemeral: it could disappear if someone reboots the computer, or there’s a power outage, or some other, unforeseen event.

R’s built-in tools for logging are rudimentary, but members of the community have developed a variety of packages for logging. Here are a few that are still actively maintained as of April 2026:

  • logger – a relatively new package that aims to improve aspects of other logging packages that R users find confusing.
  • futile.logger – a popular, mature logging package based on Apache’s Log4j utility and on R idioms.

6.3 Case Study: BART Ridership

Bay Area Rapid Transit (BART) is a passenger heavy rail service that connects cities in the greater San Francisco Bay Area. They regularly publish data about ridership in Excel and CSV format. They provide multiple datasets with different sampling frequencies and covering different periods of time. Specifically:

  • Hourly entry-exit counts from 1 January 2018 to the present. This dataset is a single CSV file.
  • Daily exit counts from 1 January 1998 to the present. This dataset is a single Excel file.
  • Monthly entry-exit averages for weekdays, Saturdays, and Sundays, from 1 January 2001 to the present. This dataset consists of multiple Excel files: one for each month-year.

The entry-exit datasets break down statistics by the entry station and exit station, whereas the exit dataset break down statistics by the exit station only. Unfortunately, this means that to obtain monthly entry-exit counts before 2018, you must manually compute them from the monthly averages. In this case study, you’ll use R to download, read, and reshape the monthly entry-exit averages dataset for 2001 to 2008.

Important

Each time we teach this case study, the result is slightly different, so we don’t provide a write-up in this chapter. Instead, here are some of the scripts from past workshops:

6.4 Case Study: U.S. Fruit Prices

The U.S. Department of Agriculture (USDA) Economic Research Service (ERS) computed national average retail prices for various fruits and vegetables in 2013, 2016, and most recently, 2020. These data provide partial insight into how much Americans must spend to eat a healthy diet with a variety of fruits and vegetables. The USDA ERS warns against using the data to compare prices across years because of differences in how products were categorized, but it can still be used to examine consumer costs in a given year.

The 2020 data are provided as per-fruit Microsoft Excel (.xlsx) files or as an all-fruits comma-separated value (CSV) file ready to use with languages such as R. Unfortunately, the 2016 and 2013 data were only distributed as per-fruit files, and the structure of these files makes extracting the data non-trivial.

The goal of this case study is to develop an R script that can read the 2016 fruit data into a single all-fruit data frame. We’ll write the script in a way that makes it easy to extend to the 2016 vegetable data or the 2013 data.

6.4.1 Setting Up the Project

There are a few things you should do every time you start working on a new project. First, create a new directory to store files related to the project. This is usually called a repository or project directory. Give the repository a descriptive name such as usda_fruit_prices.

The purpose of the repository is to keep files related to the project together, so that they’re easy to find and share. Choosing a descriptive name for the repository, and for files in general, will make it easier for others and future you to understand your work. You can read more about how to organize projects and choose names in DataLab’s Reproducibility Principles and Practices reader.

Next, create a data/ subdirectory in the repository to store data. Subdirectories are a good way to keep repositories organized, especially if you follow a naming convention. For instance, DataLab projects almost always have a data/ subdirectory to store data.

Download the zipped 2016 fruit data and save it in the data/ subdirectory. Unzip the file with a zip program or R’s own unzip function. Although the 2016 data include CSV files, these have the same structural problems as the Excel files, so we’ll focus on the Excel files.

Finally, create an R/ subdirectory in the repository to store the project’s R scripts, and make a new R script in subdirectory called extract_prices.R. You can create and open the script with RStudio or your favorite text editor.

6.4.2 Setting Up the Script

Start every script with a few comments about its purpose, usage, inputs, and outputs. These comments will serve as a blueprint for your code and as documentation. Write them before you write any code—this will help you make your ideas more concrete and keep you focused when you do start writing code.

For extract_prices.R, the goal is to extract price data from the Excel file for each fruit (or vegetable). We can extract the data to a data frame, since the data are already in a tabular format. We can have the script save the data frame as an RDS file, R’s native data format. If we want to do analysis on the data, we can use a separate script that reads the RDS file. This is a separation of concerns which ensures we can focus on the task at hand—data extraction. Summarize these ideas in comments at the beginning of extract_prices.R, by adding something like this:

# Extract a data frame of prices from USDA ERS fruit (and vegetable) price data
# Excel files. Save the data frame as an RDS file.

The key to making scripts reusable is to break computational tasks into small steps and write a short function for each step. When another script (or project) involves a similar step, you can reuse or repurpose some of the functions. This function-focused, modular development strategy also makes it easier to find and fix bugs, since you can test each function on a variety of inputs to make sure it works as intended. The strategy may also help you tackle large and complex tasks, since the component steps should be smaller and simpler, and solving one provides positive feedback. Knowing what it means for a step to be “small” takes experience, but it’s usually better to err on the side of too small rather than too large. As a rule of thumb, a function with more than one screen of code is probably trying to do too much.

Scripts you plan to run, like extract_prices.R, need some kind of starting point. Since we’re taking a function-focused approach, the starting point will be a function. There’s no standard or conventional name for a starting point function in R. Many other languages use a function called main as the starting point, so we’ll copy that convention. Add a function called main to extract_prices.R:

main = function() {
  # We'll fill this in soon!
}

Now let’s think about the steps we need to take to extract the price data. We want to extract the data from many different Excel files. If you haven’t already, open a few of the files up in a spreadsheet program to take a look at the data. The first sheet of each file contains a table with the name of the fruit and prices sorted by how the fruit was prepared. The number of rows differs between fruits, but the overall structure doesn’t seem to change, which suggests we can read and process each file in the same way. There are two things you should do when you encounter a repetitive task like this one:

  1. Start by focusing on a single case. We’ll start by focusing exclusively on extracting the data for pears (pears 2016.xlsx).

  2. Once you get the code working on a single case, test that the code is general enough to handle other cases, and then use iteration to run the code on all cases. You’ll learn more about iteration later in this case study and in Section 6.2.6.

6.4.3 Reading the Pear Data

The first step to extracting the pear data is reading the pears 2016.xlsx Excel file into R. We’ll use the read_excel function from the [readxl][] package to read the files, since R does not provide a built-in function to read Excel files. Use the R prompt to install the package:

install.packages("readxl")

Do not add this code to the extract_prices.R script. In most cases, you should not include code to install packages in your scripts. Instead, put library calls and comments near the beginning of the script, to specify which packages are required, and leave it to the user to install them. This prevents needless, time-consuming reinstallation of packages.

We’re going to use the readxl package, so add a library call for the package near the beginning of extract_prices.R, just after the purpose comments and just before the main function:

# Extract a data frame of prices from USDA ERS fruit (and vegetable) price data
# Excel files. Save the data frame as an RDS file.

library("readxl")

# ...

We’ll add more calls to library here as we decide to use more packages. This way it’s easy for someone to see which packages are required as soon as they open the script. Putting the calls to library in alphabetical order by package name is also a good idea. You can read more about how to manage the packages a script or project requires in Section 6.1.2.

When developing a script, it’s often helpful to run the code, check what’s working, and make changes, repeating this process until the script is finished. Fortunately, R provides a convenient way to run an entire script: the source function. The first argument is the path to the script you want to run. In your R prompt, try sourcing extract_prices.R:

source("R/extract_prices.R")

The path is relative to R’s working directory; see this section of DataLab’s R Basics reader for details about how to change the working directory. The code above assumes the repository is the working directory. If the script was sourced correctly, R will load the readxl package and define a function main, which currently does nothing. As you add code to main and add other functions, you can source the script again to update these functions in your R session. If you’re using R Studio, you can also check the “source on save” box in the graphical user interface to make R automatically call source every time you save the script. The source function can save you time during development, but of course it’s also okay to run code by typing or pasting it into the prompt.

Take a look at the help file for the read_excel function (run ?read_excel in the R prompt). The first argument is the path to the Excel file to read. Change the code in the main function to read the pear data:

main = function() {
  read_excel("data/fruit/pears 2016.xlsx")

  # More to come...
}

Source the script and then call main in the R prompt:

main()
New names:
• `` -> `...2`
• `` -> `...3`
• `` -> `...4`
• `` -> `...5`
• `` -> `...6`
• `` -> `...7`
# A tibble: 13 × 7
   Pears—Average retail price per pound an…¹ ...2  ...3  ...4  ...5  ...6  ...7 
   <chr>                                     <chr> <chr> <chr> <chr> <chr> <chr>
 1 Form                                      Aver… <NA>  Prep… Size… <NA>  Aver…
 2 <NA>                                      <NA>  <NA>  yiel… cup … <NA>  per …
 3 Fresh1                                    1.51… per … 0.9   0.36… poun… 0.61…
 4 Canned                                    <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
 5 Packed in juice2                          1.78… per … 1     0.54… poun… 0.96…
 6 Packed in syrup or water3                 1.57… per … 0.65  0.44… poun… 1.06…
 7 1The USDA National Nutrient Database for… <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
 8 <NA>                                      <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
 9 2Consumers are assumed to eat the solid … <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
10 <NA>                                      <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
11 3The syrup (or water) is discarded prior… <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
12 <NA>                                      <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
13 Source: Calculated by USDA, Economic Res… <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
# ℹ abbreviated name:
#   ¹​`Pears—Average retail price per pound and per cup equivalent, 2016`

This returns the data in the Excel file, but there’s still a lot of cleanup to do. Since we’re eventually going to extract data about many different fruits, let’s start by getting the name of the fruit from the Excel file.

6.4.4 Getting the Fruit Name

Notice that the name of each fruit is in the top left cell (A1) of its file. The read_excel function treats the first row of cells as headers, so in the data frame, the name of the fruit is in the first column name. Create a function called get_fruit_name to get this column name:

get_fruit_name = function(sheet) {
  names(sheet)[[1]]
}

Also update the main function to save the data frame in a variable and call get_fruit_name:

main = function() {
  sheet = read_excel("data/fruit/pears 2016.xlsx")

  get_fruit_name(sheet)
  # More to come...
}

Once again, source the script and call main. The return value will be a string:

[1] "Pears—Average retail price per pound and per cup equivalent, 2016"

This has the name of the fruit, but also some extra text. To get just the name, we need to do some string processing. We’ll use the stringr package for string processing.

Install the package and add it to the library calls at the beginning of extract_prices.R:

# Extract a data frame of prices from USDA ERS fruit (and vegetable) price data
# Excel files. Save the data frame as an RDS file.

library("readxl")
library("stringr")

# ...

We’ll use several functions from stringr, but each will be explained in context. The package provides detailed documentation as well as a cheatsheet. If you want to learn more about string processing and stringr in general, see Chapter 2.

The name of the fruit is separated from the extra text by an em dash (the character, which is longer than minus, the - character). We can use the str_split_fixed function to split the text into two parts at the em dash. The function’s first argument is the string to split, its second is the character on which to split (copy and paste the em dash), and its third is the number of parts. The function returns a matrix where each column corresponds to one of the split parts. In extract_prices.R, update get_fruit_name to split the text and get the first element, the fruit name:

get_fruit_name = function(sheet) {
  text = names(sheet)[[1]]
  str_split_fixed(text, "—", 2)[1, 1]
}

As usual, source the script and call main. The result is just the fruit name:

[1] "Pears"

We’ll use the fruit name later to label the data from each file.

6.4.5 Getting the Column Names

The column names on the sheet data frame come from the first row of the file, but the actual headers of the table are in the second and third rows of the file (rows 1 and 2 of the data frame):

# A tibble: 6 × 7
  Pears—Average retail price per pound and…¹ ...2  ...3  ...4  ...5  ...6  ...7 
  <chr>                                      <chr> <chr> <chr> <chr> <chr> <chr>
1 Form                                       Aver… <NA>  Prep… Size… <NA>  Aver…
2 <NA>                                       <NA>  <NA>  yiel… cup … <NA>  per …
3 Fresh1                                     1.51… per … 0.9   0.36… poun… 0.61…
4 Canned                                     <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
5 Packed in juice2                           1.78… per … 1     0.54… poun… 0.96…
6 Packed in syrup or water3                  1.57… per … 0.65  0.44… poun… 1.06…
# ℹ abbreviated name:
#   ¹​`Pears—Average retail price per pound and per cup equivalent, 2016`

Let’s create another function, get_column_names, to convert these rows into a vector we can use as column names. Add the function to extract_prices.R, with a code to get rows 1 and 2 of the data frame:

get_column_names = function(sheet) {
  # Get rows 1 and 2, all columns.
  sheet[1:2, ]
}

As usual, update main to call the new function:

main = function() {
  sheet = read_excel("data/fruit/pears 2016.xlsx")

  fruit = get_fruit_name(sheet)
  get_column_names(sheet)
  # More to come...
}

Then source extract_prices.R and call main:

# A tibble: 2 × 7
  Pears—Average retail price per pound and…¹ ...2  ...3  ...4  ...5  ...6  ...7 
  <chr>                                      <chr> <chr> <chr> <chr> <chr> <chr>
1 Form                                       Aver… <NA>  Prep… Size… <NA>  Aver…
2 <NA>                                       <NA>  <NA>  yiel… cup … <NA>  per …
# ℹ abbreviated name:
#   ¹​`Pears—Average retail price per pound and per cup equivalent, 2016`

For each column, we need to paste the rows together. This “for each” is a hint that the task that can be solved with iteration. Before proceeding, take a moment to learn more about iteration by reading through Section 6.2.6. For this task, the result should be one string per column, so you can use sapply to iterate over the columns.

By setting its collapse argument, you can use R’s built-in paste function to paste together the elements of a vector together. The value of collapse is inserted between the elements. For the fruit price headers, there should be a space between the row text, so we’ll set collapse = " ". In extract_prices.R, update get_column_names, source the file, and call main:

get_column_names = function(sheet) {
  # Get rows 1 and 2, all columns.
  names = sheet[1:2, ]
  sapply(names, paste, collapse = " ")
}
Pears—Average retail price per pound and per cup equivalent, 2016 
                                                        "Form NA" 
                                                             ...2 
                                        "Average retail price NA" 
                                                             ...3 
                                                          "NA NA" 
                                                             ...4 
                                       "Preparation yield factor" 
                                                             ...5 
                                       "Size of a cup equivalent" 
                                                             ...6 
                                                          "NA NA" 
                                                             ...7 
                               "Average price per cup equivalent" 

The result is almost right, but the paste function converts all of the missing values NA to literal text "NA". To fix this, replace all of the missing values with the empty string "" before pasting the rows together. Let’s add a call to str_trim at the end of get_column_names to trim off any excess spaces as well. Once again, update the function in extract_prices.R, source the file, and call main:

get_column_names = function(sheet) {
  # Get rows 1 and 2, all columns.
  names = sheet[1:2, ]
  # Paste together, treating NAs as empty string.
  names[is.na(names)] = ""
  names = sapply(names, paste, collapse = " ")
  # Trim off extra whitespace.
  str_trim(names)
}
Pears—Average retail price per pound and per cup equivalent, 2016 
                                                           "Form" 
                                                             ...2 
                                           "Average retail price" 
                                                             ...3 
                                                               "" 
                                                             ...4 
                                       "Preparation yield factor" 
                                                             ...5 
                                       "Size of a cup equivalent" 
                                                             ...6 
                                                               "" 
                                                             ...7 
                               "Average price per cup equivalent" 

Notice that now two of the elements are just the empty string "". Take a look at the Excel file. Can you see why this happens? These are the two columns with units, so let’s call them units1 and units2. The numbers are to make the column names are unique. Let’s also make the other column names easier to use by converting them to lowercase with str_to_lower and replacing the spaces with underscores _ with str_replace_all. In extract_prices.R, update get_column_names, source the file, and call main:

get_column_names = function(sheet) {
  # Get rows 1 and 2, all columns.
  names = sheet[1:2, ]
  # Paste together, treating NAs as empty strings.
  names[is.na(names)] = ""
  names = sapply(names, paste, collapse = " ")
  # Trim off extra whitespace.
  names = str_trim(names)

  # Replace empty names with "unitsN"
  is_empty = names == ""
  n_empty = sum(is_empty)
  names[is_empty] = paste("units", seq_len(n_empty), sep = "")

  # Convert to lowercase and replace spaces with underscores.
  names = str_to_lower(names)
  str_replace_all(names, " +", "_")
}
Pears—Average retail price per pound and per cup equivalent, 2016 
                                                           "form" 
                                                             ...2 
                                           "average_retail_price" 
                                                             ...3 
                                                         "units1" 
                                                             ...4 
                                       "preparation_yield_factor" 
                                                             ...5 
                                       "size_of_a_cup_equivalent" 
                                                             ...6 
                                                         "units2" 
                                                             ...7 
                               "average_price_per_cup_equivalent" 

These names look good enough to use.

6.4.6 Getting the Price Data

With the fruit name and column names taken care of, now it’s time to get the actual price data. There are a few different ways to identify the rows with price data. One way is to look for entries in the second column. Let’s set up a new function, get_fruit_prices. Eventually we’ll make this function return the price data, but to get started, let’s make it return the indexes of the first and last row with entries in the second column. Add this code to extract_prices.R:

get_fruit_prices = function(sheet) {
  is_filled = !is.na(sheet[, 2])
  range(which(is_filled))
}

Then update main, source the file, and call main:

main = function() {
  sheet = read_excel("data/fruit/pears 2016.xlsx")

  fruit = get_fruit_name(sheet)
  col_names = get_column_names(sheet)

  get_fruit_prices(sheet)
}
[1] 1 6

The first index corresponds to the header row, but we already dealt with the header in Section 6.4.5. Since the header spans two rows, we can skip it by adding 2 to the first index. The second index corresponds to the last row of prices, so it seems like this is an effective strategy for getting the price data (we can’t be completely sure until we test the strategy on other fruits).

Update get_fruit_prices and main in extract_prices.R to get the subset of rows with price data and to change the column names to the ones we computed earlier:

get_fruit_prices = function(sheet, col_names) {
  # Get all rows from first entry to last entry in column 2.
  is_filled = !is.na(sheet[, 2])
  idx = range(which(is_filled))
  prices = sheet[seq(idx[1] + 2, idx[2]), ]

  # Set column names.
  names(prices) = col_names

  prices
}


main = function() {
  sheet = read_excel("data/fruit/pears 2016.xlsx")

  fruit = get_fruit_name(sheet)
  col_names = get_column_names(sheet)

  get_fruit_prices(sheet, col_names)
}
# A tibble: 4 × 7
  form                      average_retail_price units1   preparation_yield_fa…¹
  <chr>                     <chr>                <chr>    <chr>                 
1 Fresh1                    1.5175920102         per pou… 0.9                   
2 Canned                    <NA>                 <NA>     <NA>                  
3 Packed in juice2          1.7876110202         per pou… 1                     
4 Packed in syrup or water3 1.5716448654999999   per pou… 0.65                  
# ℹ abbreviated name: ¹​preparation_yield_factor
# ℹ 3 more variables: size_of_a_cup_equivalent <chr>, units2 <chr>,
#   average_price_per_cup_equivalent <chr>

The result looks good, but notice that the "Canned" row has a several missing values. Take a look at the original Excel file to figure out why this happens. The "Canned" row is the beginning of a subgroup of prices for canned fruit. The data frame returned by get_fruit_prices will be more convenient for analysis if we exclude the "Canned" row and instead mark the following two rows as canned. For instance, we might label their form as "Canned, packed in juice" and "Canned, packed in syrup or water".

One way to approach the subgroup labeling task is to compute the subgroup for each row, then use the split-apply pattern to process each subgroup. The "Canned" row doesn’t have an entry in the second column, so maybe we can use that to detect where each subgroup begins. You can use the is.na to find those rows, and the cumulative sum function cumsum to make the subgroup labels (since FALSE counts as 0 and TRUE counts as 1). Update get_fruit_prices in extract_prices.R to split the price data by subgroup labels, source the file, and call main:

get_fruit_prices = function(sheet, col_names) {
  # Get all rows from first entry to last entry in column 2.
  is_filled = !is.na(sheet[, 2])
  idx = range(which(is_filled))
  prices = sheet[seq(idx[1] + 2, idx[2]), ]

  # Set column names.
  names(prices) = col_names

  # Compute form subgroups.
  subgroups = cumsum(is.na(prices[, 2]))
  split(prices, subgroups)
}
$`0`
# A tibble: 1 × 7
  form   average_retail_price units1    preparation_yield_factor
  <chr>  <chr>                <chr>     <chr>                   
1 Fresh1 1.5175920102         per pound 0.9                     
# ℹ 3 more variables: size_of_a_cup_equivalent <chr>, units2 <chr>,
#   average_price_per_cup_equivalent <chr>

$`1`
# A tibble: 3 × 7
  form                      average_retail_price units1   preparation_yield_fa…¹
  <chr>                     <chr>                <chr>    <chr>                 
1 Canned                    <NA>                 <NA>     <NA>                  
2 Packed in juice2          1.7876110202         per pou… 1                     
3 Packed in syrup or water3 1.5716448654999999   per pou… 0.65                  
# ℹ abbreviated name: ¹​preparation_yield_factor
# ℹ 3 more variables: size_of_a_cup_equivalent <chr>, units2 <chr>,
#   average_price_per_cup_equivalent <chr>

The result is a separate data frame for each subgroup. For each subgroup, we need to check whether the first row has an entry in the second column. If it doesn’t, then the first row is a group name (like "Canned") that should be pasted onto the other rows in the group. Add a function process_price_subgroup to extract_prices.R to do this for a single group:

process_price_subgroup = function(subgroup) {
  if (is.na(subgroup[1, 2])) {
    # First row is group name, so remove the row and paste the group name onto
    # the other rows.
    name = subgroup[[1, 1]]
    subgroup = subgroup[-1, ]
    form = str_to_lower(str_trim(subgroup$form))
    subgroup$form = paste(name, form, sep = ", ")
  }
  subgroup
}

You can use lapply to apply this function to each subgroup, and do.call with rbind to recombine the data frames. Update get_fruit_prices in extract_prices.R, source the file, and call main:

get_fruit_prices = function(sheet, col_names) {
  # Get all rows from first entry to last entry in column 2.
  is_filled = !is.na(sheet[, 2])
  idx = range(which(is_filled))
  prices = sheet[seq(idx[1] + 2, idx[2]), ]

  # Set column names.
  names(prices) = col_names

  # Compute form subgroups.
  subgroups = cumsum(is.na(prices[, 2]))
  by_subgroup = split(prices, subgroups)

  # Process each subgroup, then combine them into a single data frame.
  result = lapply(by_subgroup, process_price_subgroup)
  do.call(rbind, result)
}
# A tibble: 3 × 7
  form                        average_retail_price units1 preparation_yield_fa…¹
* <chr>                       <chr>                <chr>  <chr>                 
1 Fresh1                      1.5175920102         per p… 0.9                   
2 Canned, packed in juice2    1.7876110202         per p… 1                     
3 Canned, packed in syrup or… 1.5716448654999999   per p… 0.65                  
# ℹ abbreviated name: ¹​preparation_yield_factor
# ℹ 3 more variables: size_of_a_cup_equivalent <chr>, units2 <chr>,
#   average_price_per_cup_equivalent <chr>

It seems like the script now works for single fruits!

6.4.7 Testing the Script

Let’s try a different fruit, peaches, to make sure the script is sufficiently general. In extract_prices.R, change the file loaded in main, source the script, and call main:

main = function() {
  sheet = read_excel("data/fruit/peaches 2016.xlsx")

  fruit = get_fruit_name(sheet)
  col_names = get_column_names(sheet)

  get_fruit_prices(sheet, col_names)
}
# A tibble: 4 × 7
  form                        average_retail_price units1 preparation_yield_fa…¹
* <chr>                       <chr>                <chr>  <chr>                 
1 Fresh1                      1.6779315817         per p… 0.96                  
2 Canned, packed in juice2    1.8800453697999999   per p… 1                     
3 Canned, packed in syrup or… 1.5862877904999999   per p… 0.65                  
4 Canned, frozen              3.1870806846000002   per p… 1                     
# ℹ abbreviated name: ¹​preparation_yield_factor
# ℹ 3 more variables: size_of_a_cup_equivalent <chr>, units2 <chr>,
#   average_price_per_cup_equivalent <chr>

Uh-oh, the last row is "Canned, frozen", which doesn’t seem correct. Take a look at the Excel file for peaches to see what went wrong. The bug is that we assumed subgroups are always separated by header rows where there’s no price (like the "Canned" row). In the peaches file, the row with prices for frozen peaches is not separated this way. Looking at a few more of the files, it seems like there are only a few subgroups: Canned, Dried, Fresh, Frozen, and Juice. So let’s change the script to check specifically for these.

You can use stringr’s str_starts function to detect entries in the form column which start with one of the subgroup names. The second argument to the function is a regular expression (see Section 2.3), a string which describes a pattern. You can use the pipe character | in a regular expression to separate several different options. Update the code for get_fruit_prices in extract_prices.R, source the file, and call main:

get_fruit_prices = function(sheet, col_names) {
  # Get all rows from first entry to last entry in column 2.
  is_filled = !is.na(sheet[, 2])
  idx = range(which(is_filled))
  prices = sheet[seq(idx[1] + 2, idx[2]), ]

  # Set column names.
  names(prices) = col_names

  # Compute form subgroups.
  subgroup_names = "Canned|Dried|Fresh|Frozen|Juice"
  subgroups = cumsum(str_starts(prices$form, subgroup_names))
  by_subgroup = split(prices, subgroups)

  # Process each subgroup, then combine them into a single data frame.
  result = lapply(by_subgroup, process_price_subgroup)
  do.call(rbind, result)
}
# A tibble: 4 × 7
  form                        average_retail_price units1 preparation_yield_fa…¹
* <chr>                       <chr>                <chr>  <chr>                 
1 Fresh1                      1.6779315817         per p… 0.96                  
2 Canned, packed in juice2    1.8800453697999999   per p… 1                     
3 Canned, packed in syrup or… 1.5862877904999999   per p… 0.65                  
4 Frozen                      3.1870806846000002   per p… 1                     
# ℹ abbreviated name: ¹​preparation_yield_factor
# ℹ 3 more variables: size_of_a_cup_equivalent <chr>, units2 <chr>,
#   average_price_per_cup_equivalent <chr>

This appears to be correct. Test the script on the pears data again as well, to make sure the changes didn’t introduce other bugs:

# A tibble: 3 × 7
  form                        average_retail_price units1 preparation_yield_fa…¹
* <chr>                       <chr>                <chr>  <chr>                 
1 Fresh1                      1.5175920102         per p… 0.9                   
2 Canned, packed in juice2    1.7876110202         per p… 1                     
3 Canned, packed in syrup or… 1.5716448654999999   per p… 0.65                  
# ℹ abbreviated name: ¹​preparation_yield_factor
# ℹ 3 more variables: size_of_a_cup_equivalent <chr>, units2 <chr>,
#   average_price_per_cup_equivalent <chr>

Some of the columns in the data frame have inappropriate data types. This can be fixed with just a few lines of code, but since it doesn’t provide any additional insight into how to write scripts, we won’t do so here. If you want to know more about how to fix “messy data” problems, start with Chapter 1.

6.4.8 Iterating Over All Fruit

The final step in extracting the fruit price data is to modify extract_prices.R to extract the prices of all fruits. This is also a good time to make several small changes to the script that will make it more convenient to use.

First, change the name of main to read_price_file, make the path to the Excel file a parameter called path, and limit the Excel sheet to seven columns:

read_price_file = function(path) {
  sheet = read_excel(path)
  sheet = sheet[, 1:7]

  fruit = get_fruit_name(sheet)
  col_names = get_column_names(sheet)

  prices = get_fruit_prices(sheet, col_names)
  prices$fruit = fruit
  prices
}

This change makes it possible to reuse the function for many different fruit (and vegetable) Excel files.

Next, let’s create a new main function, which will call read_price_file on all of the Excel files in a user-specified directory. The function should have two parameters: input_dir, for the path to the directory that contains the Excel files, and output_path, for a path to an RDS file where the resulting data frame should be saved. You can use list.files to get a vector of paths to all files in the input directory, lapply to iterate over the files, and saveRDS to save the result. Here’s the code to add to extract_prices.R:

main = function(
  input_dir = "data/fruit/",
  output_path = "prices.rds"
) {
  # Get vector of all files in directory.
  message("Input directory: ", input_dir)
  paths = list.files(input_dir, pattern = "*.xlsx", full.names = TRUE)

  # Read each file and combine the resulting data frames.
  result = lapply(paths, function(path) {
    message("Reading: ", path)
    read_price_file(path)
  })
  result = do.call(rbind, result)

  # Save the result to an RDS file.
  saveRDS(result, output_path)
  message("Wrote:", output_path)

  result
}

The code for the function includes several calls to message to print informational messages. It’s a good idea to include these so that you (or other users) can verify that the paths to files are correct and can quickly diagnose any bugs you encounter. You can read more about various ways to print output in R in Section 6.2.7.

Once again, test that the code works by sourcing the script and calling main:

# A tibble: 62 × 8
   form                       average_retail_price units1 preparation_yield_fa…¹
   <chr>                      <chr>                <chr>  <chr>                 
 1 Fresh1                     1.6155336441000001   per p… 0.9                   
 2 Applesauce2                1.0491006433000001   per p… 1                     
 3 Juice, ready to drink3     0.63113252779999995  per p… 1                     
 4 Frozen4                    0.51046574550000001  per p… 1                     
 5 Fresh1                     3.0871378169999999   per p… 0.93                  
 6 Canned, packed in juice2   1.4742760156000001   per p… 1                     
 7 Canned, packed in syrup, … 1.8552908088         per p… 0.65                  
 8 Dried4                     7.3309645323000003   per p… 1                     
 9 Fresh1                     0.54941729279999996  per p… 0.64                  
10 Frozen1                    3.63694066           per p… 1                     
# ℹ 52 more rows
# ℹ abbreviated name: ¹​preparation_yield_factor
# ℹ 4 more variables: size_of_a_cup_equivalent <chr>, units2 <chr>,
#   average_price_per_cup_equivalent <chr>, fruit <chr>

This looks almost perfect. There are still a few minor issues with data types, and also a small bug (related to frozen juice), but all of these can be fixed with small changes to the script.