::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)
knitr
library(tidyverse)
library(palmerpenguins)
The point of this blog post is to walk through a pattern I’ve started using in some of my analyses that combines do.call()
, purrr::pmap()
, and some wrapper functions to customize how a given analysis gets run. I’ll start by demonstrating do.call()
and pmap()
separately, then showing how you can use them together to do some cool things. I’m not going to go super in-depth on either do.call()
or pmap()
, so it might be worthwhile to look into some of the documentation for those functions separately.
Also – I’m going to use the {palmerpenguins}
data here to illustrate this workflow. And, like, as is typically the case with toy data, the point here isn’t to run a suite of analyses that answer meaningful questions about this data, but rather to demonstrate how to combine these functions in a way that could help you answer meaningful questions for your own data.
With all of that said, onward and upward!
Setup
To start, let’s load the packages we’ll need.
Let’s also take a quick peeksie at the penguins data, although the content of the data isn’t terrible important here.
glimpse(penguins)
Rows: 344
Columns: 8
$ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex <fct> male, female, female, NA, female, male, female, male…
$ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
Cool cool. Now, let’s assume we want to analyze this penguins data. Let’s say we want to estimate a mean, a correlation coefficient, and fit a linear regression, and that this is our workflow (n.b. again that this probably shouldn’t be your actual workflow when you analyze data).
Let’s say we want to get the mean body mass – this is easy for us.
mean(penguins$body_mass_g, na.rm = TRUE)
[1] 4201.754
Another way we can do the exact same thing is with do.call()
. do.call()
has a “what” argument, to which you provide the function you want to call (or the character string name of the function), and an “args” argument, where you list the arguments to pass to “what”. It has some other arguments, too, but I’m going to ignore those here. So, the call below does the exact same thing we did previously:
do.call(what = "mean", args = list(penguins$body_mass_g, na.rm = TRUE))
[1] 4201.754
The nice thing about do.call is that it’s very flexible. Say we wanted to run a correlation between body mass and bill depth. We can do this by directly calling the cor()
function:
# option 1:
cor(
x = penguins$body_mass_g,
y = penguins$bill_depth_mm,
use = "pairwise.complete.obs"
)
[1] -0.4719156
Or we can do the exact same thing via do.call()
:
# option 2
do.call("cor",
args = list(
x = penguins$body_mass_g,
y = penguins$bill_depth_mm,
use = "pairwise.complete.obs"
) )
[1] -0.4719156
Or say we wanted to run a linear regression with body mass regressed on bill depth and sex. Again, we can call lm()
directly:
# option 1:
<- lm(body_mass_g ~ bill_depth_mm + sex, data = penguins, na.action = "na.omit")
res1
::glance(res1) broom
# A tibble: 1 × 12
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.642 0.640 483. 296. 2.34e-74 2 -2529. 5066. 5081.
# … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
Or via do.call()
:
#option 2
<- do.call("lm", args = list(
res2 formula = body_mass_g ~ bill_depth_mm + sex,
data = penguins,
na.action = "na.omit"
))
::glance(res2) broom
# A tibble: 1 × 12
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.642 0.640 483. 296. 2.34e-74 2 -2529. 5066. 5081.
# … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
Combining with purrr::pmap ()
Just based on the above, do.call()
isn’t really doing anything useful for us. It’s just a slightly more verbose way to call a function. But where do.call()
really shines is when you pair it with some iteration – which we’ll do now, via purrr::pmap()
– and/or some conditional logic (which we’ll add later via a wrapper function). Basically it shines with you program with it, is what I’m trying to say.
For those that don’t know, purrr::pmap()
extends purrr::map()
to allow for an arbitrary number of arguments to map over in parallel. If you’re not familiar with purrr::map()
, Hadley’s R for Data Science book has a good chapter on it. But anyway, let’s illustrate pmap()
by running a handful of correlations on some sample data
#generate data
<- rnorm(100)
a <- rnorm(100)
b <- rnorm(100)
d
#put data into a list
<- list(
sample_args x = list(a, b, d),
y = list(b, d, a)
)
This gives us a list of x and y values, where the first element of x
is a
, the first element of y
is b
, etc etc. We can run a bunch of correlations – x[[1]]
with y[[1]]
, x[[2]]
with y[[2]]
etc – by using pmap()
and cor()
:
pmap(sample_args, ~cor(..1, ..2, use = "pairwise.complete.obs"))
[[1]]
[1] 0.0467708
[[2]]
[1] 0.1479934
[[3]]
[1] -0.07458596
Which can be a helpful pattern.
What’s potentially more interesting, though, is that we can also use pmap()
in conjunction with do.call()
to not only iterate through arguments passed to a given function (like we do with cor()
above), but to also iterate over various functions:
#create a vector of function names
<- c("mean", "cor", "lm")
funs
#create a list of function arguments, where each element of the list is a list of args
<- list(
fun_args list(penguins$body_mass_g, na.rm = TRUE),
list(
$body_mass_g,
penguins$bill_depth_mm,
penguinsuse = "pairwise.complete.obs"
),list(
formula = body_mass_g ~ bill_depth_mm + sex,
data = penguins,
na.action = "na.omit"
)
)
#combine the function names and args into a tibble
<- tibble(
fun_iterator f = funs,
fa = fun_args
)
#take a look at the tibble
glimpse(fun_iterator)
Rows: 3
Columns: 2
$ f <chr> "mean", "cor", "lm"
$ fa <list> [<3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, 4250, 3300, 3…
What we’re doing in the above code is:
- creating a list of function names;
- creating a list of function arguments (where each element of the list is a list of args);
- binding these lists together in a tibble.
Then, we can then execute all of these functions with their corresponding arguments with do.call()
:
<- pmap(fun_iterator, ~ do.call(..1, ..2)) res
Within do.call()
, we’re passing the first column of our fun_iterator
table to the first argument of do.call()
(as denoted by ..1), and the second column of the tibble to the second argument of do.call()
(as denoted by ..2). This will give us a list, res
, where each element is the result of the function/argument combination in our fun_iterator
tibble.
To prove it worked, let’s look at the results:
#mean
1]] res[[
[1] 4201.754
#cor
2]] res[[
[1] -0.4719156
#lm
::glance(res[[3]]) broom
# A tibble: 1 × 12
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.642 0.640 483. 296. 2.34e-74 2 -2529. 5066. 5081.
# … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
In theory, you can specify an entire set of analyses ahead of time and then execute them using pmap()
+ do.call()
if you wanted to. So let’s at one way we might do that via a wrapper function.
Wrap Your Analyses
The real power of this is to write a function that wraps all of these components and allows you to run just a subset of them. And this is how I actually use this pattern in my own work. But I’ll touch on some real-world applications after we go through the code below.
Let’s start by writing a wrapper function that has 1 argument, include
, where include
is a character vector of function names.
<- function(include = c("mean", "cor", "lm")) {
analyze_penguins #some code here
}
Then let’s drop all of the code that we just ran into the function:
<- function(include = c("mean", "cor", "lm")) {
analyze_penguins #we already ran all of this
<- c("mean", "cor", "lm")
funs
<- list(
fun_args list(penguins$body_mass_g, na.rm = TRUE),
list(
$body_mass_g,
penguins$bill_depth_mm,
penguinsuse = "pairwise.complete.obs"
),list(
formula = body_mass_g ~ bill_depth_mm + sex,
data = penguins,
na.action = "na.omit"
)
)
<- tibble(
fun_iterator f = funs,
fa = fun_args
) }
And then we subset the fun_iterator
tibble to only include the functions we include in the include
argument of our wrapper function, and executed only those functions via pmap()
+ do.call()
:
<- function(include = c("mean", "cor", "lm")) {
analyze_penguins #this is all the same as previously
<- c("mean", "cor", "lm")
funs
<- list(
fun_args list(penguins$body_mass_g, na.rm = TRUE),
list(
$body_mass_g,
penguins$bill_depth_mm,
penguinsuse = "pairwise.complete.obs"
),list(
formula = body_mass_g ~ bill_depth_mm + sex,
data = penguins,
na.action = "na.omit"
)
)
<- tibble(
fun_iterator f = funs,
fa = fun_args
)
# filter to only a subset of these functions that we've asked for in the wrapper args
<- fun_iterator[fun_iterator$f %in% include, ]
fun_iterator
#execute these functions
pmap(fun_iterator, ~do.call(..1, ..2))
}
So, say we just wanted the mean:
analyze_penguins("mean")
[[1]]
[1] 4201.754
Or just the mean and the correlation:
analyze_penguins(c("mean", "cor"))
[[1]]
[1] 4201.754
[[2]]
[1] -0.4719156
Or just the linear model:
::glance(analyze_penguins("lm")[[1]]) broom
# A tibble: 1 × 12
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.642 0.640 483. 296. 2.34e-74 2 -2529. 5066. 5081.
# … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
I really like this pattern for data cleaning. I have a handful of demographic variables that I regularly work with that need to be cleaned and/or recoded, and I have some helper functions I’ve written to clean/recode each of them individually. But I also have a “meta” recode_demographics()
function that can execute any combination of my helper functions depending on what I need for a given project. You can obviously also write your wrapper function to give you more control over the arguments to each constituent function (like by allowing you to pass in a formula to lm()
, for instance, rather than hardcoding your formula), which can make this whole approach very flexible! It can be a bit time-consuming to write a wrapper that gives you the right level of flexibility, but if you have a set of related tasks you do frequently, I think it’s worth the time to figure out.
Reuse
Citation
@online{ekholm2022,
author = {Ekholm, Eric},
title = {Combining Pmap and Do.call},
date = {2022-03-15},
url = {https://www.ericekholm.com/posts/combining-pmap-and-docall},
langid = {en}
}