What is targets?

The targets package allows you to separate your data analysis/monitoring code into smaller chunks that are then pieced together into a pipeline through established dependencies.

Once the dependencies are defined, targets keeps track of which code chunks are up-to-date and what needs to be re-run.

Before getting too into the weeds, let’s look at an example.

Basic Example

We will use the iris dataset to build a basic plot and summary table using the targets framework.

library(targets)
library(dplyr)
library(ggplot2)
library(stringr)
library(gtsummary)

Define a target using tar_target that we will call data_iris which loads in the iris dataset and sorts by Sepal.Length. Most of the time, the R code we’d like to run within the command argument, is too messy to include within tar_target so we write it as a separate function and call the function within command.

data_load_iris <- function() {
  iris %>% 
    arrange(Sepal.Length)
}     

tar_target(name = data_iris,
           command = data_load_iris())

Next, define a target that plots the data stored in data_iris. Note, here we put the R code within target to demonstrate that it is equivalent, but it can be more difficult to read.

tar_target(name = fig_iris,
           command = data_iris %>% 
             ggplot(aes(x = Sepal.Width, y = Sepal.Length, color = Species)) +
             geom_point()
           )

Lastly, define a target that creates a table based on the data stored in data_iris.

tabulate_data_iris <- function(data_iris) {
  data_iris %>%
    tbl_summary(
      by = Species,
      statistic = all_continuous() ~ c("{mean} ({min}, {max})"),
      label = list(
        Sepal.Length = "Sepal Length",
        Sepal.Width = "Sepal Width",
        Petal.Length = "Petal Length",
        Petal.Width = "Petal Width"
      )
    )
}


tar_target(name = tbl_iris,
           command = tabulate_data_iris(data_iris)
           )

Running tar_make() for the first time creates all 3 of our defined targets.

data_iris
fig_iris
tbl_iris

targets::tar_make()

## ▶ dispatched target data_iris
## ● completed target data_iris [0.32 seconds]
## ▶ dispatched target tbl_iris
## ● completed target tbl_iris [0.67 seconds]
## ▶ dispatched target fig_iris
## ● completed target fig_iris [0 seconds]
## ▶ completed pipeline [1.16 seconds]

We can load the targets into our environment with tar_load to view the contents of each target.

tar_load(c(data_iris, fig_iris, tbl_iris))

head(data_iris)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          4.3         3.0          1.1         0.1  setosa
## 2          4.4         2.9          1.4         0.2  setosa
## 3          4.4         3.0          1.3         0.2  setosa
## 4          4.4         3.2          1.3         0.2  setosa
## 5          4.5         2.3          1.3         0.3  setosa
## 6          4.6         3.1          1.5         0.2  setosa

fig_iris

tbl_iris

Characteristic	setosa, N = 50¹	versicolor, N = 50¹	virginica, N = 50¹
Sepal Length	5.01 (4.30, 5.80)	5.94 (4.90, 7.00)	6.59 (4.90, 7.90)
Sepal Width	3.43 (2.30, 4.40)	2.77 (2.00, 3.40)	2.97 (2.20, 3.80)
Petal Length	1.46 (1.00, 1.90)	4.26 (3.00, 5.10)	5.55 (4.50, 6.90)
Petal Width	0.25 (0.10, 0.60)	1.33 (1.00, 1.80)	2.03 (1.40, 2.50)
¹ Mean (Range)

Why use targets?

You may be thinking, why not just create data_iris as an R object like normal?

data_iris2 <- iris %>% 
  arrange(Sepal.Length)

head(data_iris2)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          4.3         3.0          1.1         0.1  setosa
## 2          4.4         2.9          1.4         0.2  setosa
## 3          4.4         3.0          1.3         0.2  setosa
## 4          4.4         3.2          1.3         0.2  setosa
## 5          4.5         2.3          1.3         0.3  setosa
## 6          4.6         3.1          1.5         0.2  setosa

In a simple example like this, sure that’s fair. But even in a small project, I’d argue that targets has an advantage.

Let’s say that we want to modify our table. I decided to fix the strings for Species (“setosa”, “versicolor”, and “viriginica”) so that the first letter of each word is capitalized. So I update tabulate_data_iris and re-run tar_make.

tabulate_data_iris <- function(data_iris) {
  data_iris %>%
    mutate(Species = str_to_title(Species)) %>%
    tbl_summary(
      by = Species,
      statistic = all_continuous() ~ c("{mean} ({min}, {max})"),
      label = list(
        Sepal.Length = "Sepal Length",
        Sepal.Width = "Sepal Width",
        Petal.Length = "Petal Length",
        Petal.Width = "Petal Width"
      )
    )
}

tar_make()

## ✔ skipped target data_iris
## ▶ dispatched target tbl_iris
## ● completed target tbl_iris [0.76 seconds]
## ✔ skipped target fig_iris
## ▶ completed pipeline [1.27 seconds]

Note that targets knew to skip running both data_iris and fig_iris again because nothing changed, but it re-ran tbl_iris like we wanted!

How does it know to do this? Let’s take a look at the dependency structure we’ve set up.

tar_visnetwork()

We can see that both fig_iris and tbl_iris relay on the dataset data_iris. tbl_iris also relies on the function tabulate_data_iris. In our previous example, we updated tabulate_data_iris which is only used in tbl_iris, therefore only the tbl_iris target needed to be re-run.

Let’s say, however, that we no longer want to use any iris data where the species is “versicolor”. We need to update our data target.

data_load_iris <- function() {
  iris %>% 
    arrange(Sepal.Length) %>%
    filter(Species != "versicolor")
}

Re-running the dependency map shows that now all 3 of our targets are outdated because we modified data_load_iris which is used in data_iris and both our figure and table depend on data_iris.

tar_visnetwork()

So, when we re-run tar_make…

tar_make()

## ▶ dispatched target data_iris
## ● completed target data_iris [0.33 seconds]
## ▶ dispatched target tbl_iris
## ● completed target tbl_iris [0.74 seconds]
## ▶ dispatched target fig_iris
## ● completed target fig_iris [0.01 seconds]
## ▶ completed pipeline [1.25 seconds]

We see all 3 of our targets are dispatched and completed as expected!

Try it yourself!

The main folder of this repo contains a similar pipeline in “_targets.R” as we worked through in this example. To try it yourself, close this repo on your personal machine, run tar_make in your console to run the pipeline, and add to or modify the pipeline to see how targets reacts.

Benefits of using targets

The main benefits in using targets to structure your pipeline are efficiency, consistency, and reproducibility.

1) Efficiency

Dependencies

Scenario 1: Let’s say you you’re running a complicated analysis that requires a significant amount of run time each time it occurs. Without targets, you may choose to have all of your script in one program. It’s reproducible sure, but what happens when you just want to make a slight change to the table at the very end of the document? You have to run all of the script again and wait for it to finish. Only to find out there is an error at the very end!

Scenario 2: You then decided to save yourself time and split your analysis code into several parts so each can be debugged separately. You run the time-consuming analyses separately and save each as .rds files. In your other analysis programs, you read in the .rds files to and make your tables and figures. Six months later, you get updated data. Now you manually need to rerun all the analysis programs on your new data. Do you remember exactly which files need to be updated?

targets solves these problems by having you initially declare the dependencies with tar_target. targets will manage updating your out-of-date targets for you when you run tar_make

Storing metadata

When a target is run with tar_make, the results are stored in the “_targets” folder. This means I can simply use tar_load to bring the target into my environment even though I ran tar_make yesterday in a different instance of R. Its a great time saver when coding a targets project interactively in R. However, remember that you may be working with outdated data and if you aren’t sure if something has changed in a target, run tar_make first before loading targets into the environment.

2) Consistency

targets allows you to break up your pipeline into chunks (targets) that can be reused for multiple purposes such as different reports. Take for example a target that contains a data frame of your randomized participant data with labeled factors for site IDs. You can reuse this target in other targets as a function input that allows you to merge in the same formatted randomized data information into each target you create. Now each time you report the site, it will use the same site label that you created in your randomized participant data target, and there is only one location to update code if the label changes or a new site is added. See leap-report’s data_randomized target and how that is reused in many future targets.

3) Reproducibility

When tar_make runs without error on your machine, everything in your pipeline should be reproducible on another computer. This is because targets runs the pipeline in a fresh R session just like R Markdown. Therefore, nothing in your own environment could be affecting the success tar_make.

Typically if tar_make runs without error on one machine but not another, it is due to variation in package versions. See guideline-reproducibility for information on the renv package and how package management is essential to creating a reproducible pipeline on all machines.

Coding practices with targets

File organization

Each project will look slightly different depending on its purpose, many projects may include these files/folders

R/ - includes all R programs
reports/ - includes .Rmd or .qmd files that render reports such as web reports or DSMB reports
_targets.R - this R file contains your targets pipeline
.gitignore - Be sure to add _targets/ to the .gitignore so that you don’t push any data to GitHub accidentally. See the guideline-data-safety repo for more information about .gitignore
packages.R - An R program to load all packages into R session. This should be sourced in _targets.R

For a larger example of what a reporting pipeline for a Data Coordinating Center could look like resulting in a DSMB report, see leap-report

Naming targets

Your project should set some guidelines for naming based off of your anticipated pipeline. Below are suggestions for what the first word of a target should be named and what it should generally do.
- data: these targets create a dataset to be used downstream.
- result: these targets leverage a dataset to make some result
- tbl: these targets leverage a result to make a table object (e.g., a flextable)
- fig: these targets leverage a result to make a figure (e.g., a ggplot2 object)
- report : these targets create a report document

Functions

Functions may be named with the convention of noun_verb_detail(). For example,
- data_load_participants() data is the noun, load the verb, and participants indicates exactly what kind of data are being loaded
- labels_load() labels is the noun, load the verb, and no details are given because the labels object is pretty general.
The primary inputs for functions should be targets. Non-target inputs should be ordered after target inputs and should have reasonable default values.