# purrr for analysis in R

## 2018/12/11

In my postdoc work, I was running a lot of models on data. I found R really useful to doing the models, but I often struggled to write nice code around running many models. Until I discovered purrr.

As an example, say I wanted to do some analysis on the “iris” data set, which “gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris”, Iris setosa, I. versicolor, and I. virginica.

iris %>%
kable(caption = 'First records in the iris data')
Table 1: First records in the iris data
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
5.1 3.5 1.4 0.2 setosa
4.9 3.0 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa

Let’s consider the relationship between sepal length and width in the three species.

iris %>%
ggplot(aes(x = Sepal.Length, y = Sepal.Width, color = Species)) +
stat_smooth(method = 'lm') +
geom_point()

In base R it is relatively straightforward to extract the slope $$m$$ and intercept $$b$$ for a least-squares best fit to each species by specifying a model that includes interactions between species $$s$$ and slope: $y_i \sim m_{s_i} x_i + b_{s_i}$

lm(Sepal.Width ~ Species * Sepal.Length, data = iris) %>%
summary()
##
## Call:
## lm(formula = Sepal.Width ~ Species * Sepal.Length, data = iris)
##
## Residuals:
##      Min       1Q   Median       3Q      Max
## -0.72394 -0.16327 -0.00289  0.16457  0.60954
##
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)
## (Intercept)                     -0.5694     0.5539  -1.028 0.305622
## Speciesversicolor                1.4416     0.7130   2.022 0.045056 *
## Speciesvirginica                 2.0157     0.6861   2.938 0.003848 **
## Sepal.Length                     0.7985     0.1104   7.235 2.55e-11 ***
## Speciesversicolor:Sepal.Length  -0.4788     0.1337  -3.582 0.000465 ***
## Speciesvirginica:Sepal.Length   -0.5666     0.1262  -4.490 1.45e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2723 on 144 degrees of freedom
## Multiple R-squared:  0.6227, Adjusted R-squared:  0.6096
## F-statistic: 47.53 on 5 and 144 DF,  p-value: < 2.2e-16

In that regression, R treated I. setosa as the “base” species, so (Intercept) and Sepal.Length are the intercept and slope for I. setosa. The intercept for I. versicolor is the (Intercept) term plus the Speciesversicolor term, and the slope is the Sepal.Length term plus the Speciesversicolor:Sepal.Length term.

This is a bit of a mess if I just wanted to know what the slopes and intercepts were for each species. If I wanted three separate models, there’s a few ways to do it.

I’ll start with the ugly, naive way, which is to make three different data frame and three different models:

# these two lines would be repeated for I. versicolor and I. virginica
setosa_data = iris[iris$Species == 'setosa', ] setosa_model = lm(Sepal.Width ~ Sepal.Length, data = setosa_data) # the slope for I. setosa coef(setosa_model)['Sepal.Length'] ## Sepal.Length ## 0.7985283 This approach, although it works for the iris data, doesn’t scale well, since it requires a lot of manual typing of the species names. The second approach is to use some the base R functions split, which will break the data into a list of three separate data sets, and lapply, which will run a function on each member of that list: species_datasets = split(iris, iris$Species)
species_models = lapply(species_datasets, function(data) {
lm(Sepal.Width ~ Sepal.Length, data = data)
})

# get the slope from the first model, which is for I. setosa
coef(species_models[[1]])['Sepal.Length']
## Sepal.Length
##    0.7985283

This approach is still awkward because the data and the models are in separate objects. You need to manually keep track of them, make sure they are in the same order, etc. It would be a lot nicer if the data, the models, and whatever else you wanted to pull out from either of those were in one structure.

It turns out that data frame the columns in data frames can be lists, so we can make a vector in the data frame a list of objects:

data_frame(
species = levels(iris\$Species),
model = species_models
)
## # A tibble: 3 x 2
##   species    model
##   <chr>      <list>
## 1 setosa     <S3: lm>
## 2 versicolor <S3: lm>
## 3 virginica  <S3: lm>

(The “S3” in S3: lm means that the linear model object is an S3 object. R has multiple object-oriented programming systems, and S3 is one of them. It’s not that important.)

This is where purrr comes in. First, there is a function nest. Like split, it breaks a dataset down into smaller parts. Unlike split, it keeps them in a nice dataframe. It’s called “nest” because one of the columns in the data frame is a list of the smaller data frames.

# For this and future examples, it will be convenient to have the iris
# data as a tibble, rather than data.frame. It prints out nicer.
iris %<>% as_tibble

# "minus" means put all columns except Species into the nested dataframes
nest(iris, -Species)
## # A tibble: 3 x 2
##   Species    data
##   <fct>      <list>
## 1 setosa     <tibble [50 × 4]>
## 2 versicolor <tibble [50 × 4]>
## 3 virginica  <tibble [50 × 4]>

Next, there is a family of functions map. Just like map functions in other languages and the Map and lapply functions in base R, it takes a function and a vector of inputs, or multiple inputs, and returns the outputs.

iris %>%
nest(-Species) %>%
mutate(
# this works
model1 = lapply(data, function(x) lm(Sepal.Width ~ Sepal.Length, data = x)),
# "map" also works
model2 = map(data, function(x) lm(Sepal.Width ~ Sepal.Length, data = x)),
# "map" also allows a shorthand for anonymous functions
model3 = map(data, ~ lm(Sepal.Width ~ Sepal.Length, data = .))
)
## # A tibble: 3 x 5
##   Species    data              model1   model2   model3
##   <fct>      <list>            <list>   <list>   <list>
## 1 setosa     <tibble [50 × 4]> <S3: lm> <S3: lm> <S3: lm>
## 2 versicolor <tibble [50 × 4]> <S3: lm> <S3: lm> <S3: lm>
## 3 virginica  <tibble [50 × 4]> <S3: lm> <S3: lm> <S3: lm>

The really nice bit about map is that is has close cousins like map_dbl which specify that the output should be a vector of numbers, rather than a list of numbers. Here’s what I mean:

iris %>%
nest(-Species) %>%
mutate(
model = map(data, ~ lm(Sepal.Width ~ Sepal.Length, data = .)),
slope1 = map(model, ~ coef(.)['Sepal.Length']),
slope2 = map_dbl(model, ~ coef(.)['Sepal.Length'])
)
## # A tibble: 3 x 5
##   Species    data              model    slope1    slope2
##   <fct>      <list>            <list>   <list>     <dbl>
## 1 setosa     <tibble [50 × 4]> <S3: lm> <dbl [1]>  0.799
## 2 versicolor <tibble [50 × 4]> <S3: lm> <dbl [1]>  0.320
## 3 virginica  <tibble [50 × 4]> <S3: lm> <dbl [1]>  0.232

Note that the map for slope in the above gave a list of single numbers, while map_dbl gave a vector that “lays” more nicely in the data frame.

Now it’s nice and easy to compare the models in the context of a data frame:

iris %>%
nest(-Species) %>%
mutate(
model = map(data, ~ lm(Sepal.Width ~ Sepal.Length, data = .)),
slope = map_dbl(model, ~ coef(.)['Sepal.Length']),
slope_confint = map(model, ~ confint(.)['Sepal.Length', ]),
slope_cil = map_dbl(slope_confint, first),
slope_ciu = map_dbl(slope_confint, last)
) %>%
ggplot(aes(x = Species, y = slope, ymin = slope_cil, ymax = slope_ciu)) +
geom_point() +
geom_errorbar()

Note that this final result only computed each model once, only computed the confindence intervals once, and managed to make the output figure without once creating an intermediate object!