In my postdoc work, I was running a lot of models on data. I found R really useful to doing the models, but I often struggled to write nice code around running many models. Until I discovered purrr.

As an example, say I wanted to do some analysis on the “iris” data set, which “gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris”, *Iris setosa*, *I. versicolor*, and *I. virginica*.

```
iris %>%
head(3) %>%
kable(caption = 'First records in the iris data')
```

Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
---|---|---|---|---|

5.1 | 3.5 | 1.4 | 0.2 | setosa |

4.9 | 3.0 | 1.4 | 0.2 | setosa |

4.7 | 3.2 | 1.3 | 0.2 | setosa |

Let’s consider the relationship between sepal length and width in the three species.

```
iris %>%
ggplot(aes(x = Sepal.Length, y = Sepal.Width, color = Species)) +
stat_smooth(method = 'lm') +
geom_point()
```

In base R it is relatively straightforward to extract the slope \(m\) and intercept \(b\) for a least-squares best fit to each species by specifying a model that includes interactions between species \(s\) and slope: \[ y_i \sim m_{s_i} x_i + b_{s_i} \]

```
lm(Sepal.Width ~ Species * Sepal.Length, data = iris) %>%
summary()
```

```
##
## Call:
## lm(formula = Sepal.Width ~ Species * Sepal.Length, data = iris)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.72394 -0.16327 -0.00289 0.16457 0.60954
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.5694 0.5539 -1.028 0.305622
## Speciesversicolor 1.4416 0.7130 2.022 0.045056 *
## Speciesvirginica 2.0157 0.6861 2.938 0.003848 **
## Sepal.Length 0.7985 0.1104 7.235 2.55e-11 ***
## Speciesversicolor:Sepal.Length -0.4788 0.1337 -3.582 0.000465 ***
## Speciesvirginica:Sepal.Length -0.5666 0.1262 -4.490 1.45e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2723 on 144 degrees of freedom
## Multiple R-squared: 0.6227, Adjusted R-squared: 0.6096
## F-statistic: 47.53 on 5 and 144 DF, p-value: < 2.2e-16
```

In that regression, R treated *I. setosa* as the “base” species, so `(Intercept)`

and `Sepal.Length`

are the intercept and slope for *I. setosa*. The intercept for *I. versicolor* is the `(Intercept)`

term plus the `Speciesversicolor`

term, and the slope is the `Sepal.Length`

term plus the `Speciesversicolor:Sepal.Length`

term.

This is a bit of a mess if I just wanted to know what the slopes and intercepts were for each species. If I wanted three separate models, there’s a few ways to do it.

I’ll start with the ugly, naive way, which is to make three different data frame and three different models:

```
# these two lines would be repeated for I. versicolor and I. virginica
setosa_data = iris[iris$Species == 'setosa', ]
setosa_model = lm(Sepal.Width ~ Sepal.Length, data = setosa_data)
# the slope for I. setosa
coef(setosa_model)['Sepal.Length']
```

```
## Sepal.Length
## 0.7985283
```

This approach, although it works for the iris data, doesn’t scale well, since it requires a lot of manual typing of the species names. The second approach is to use some the base R functions `split`

, which will break the data into a list of three separate data sets, and `lapply`

, which will run a function on each member of that list:

```
species_datasets = split(iris, iris$Species)
species_models = lapply(species_datasets, function(data) {
lm(Sepal.Width ~ Sepal.Length, data = data)
})
# get the slope from the first model, which is for I. setosa
coef(species_models[[1]])['Sepal.Length']
```

```
## Sepal.Length
## 0.7985283
```

This approach is still awkward because the data and the models are in separate objects. You need to manually keep track of them, make sure they are in the same order, etc. It would be a lot nicer if the data, the models, and whatever else you wanted to pull out from either of those were in one structure.

It turns out that data frame the columns in data frames can be lists, so we can make a vector in the data frame a list of objects:

```
data_frame(
species = levels(iris$Species),
model = species_models
)
```

```
## # A tibble: 3 x 2
## species model
## <chr> <list>
## 1 setosa <S3: lm>
## 2 versicolor <S3: lm>
## 3 virginica <S3: lm>
```

(The “S3” in `S3: lm`

means that the linear model object is an S3 object. R has multiple object-oriented programming systems, and S3 is one of them. It’s not that important.)

This is where `purrr`

comes in. First, there is a function `nest`

. Like `split`

, it breaks a dataset down into smaller parts. Unlike `split`

, it keeps them in a nice dataframe. It’s called “nest” because one of the columns in the data frame is a list of the smaller data frames.

```
# For this and future examples, it will be convenient to have the iris
# data as a tibble, rather than data.frame. It prints out nicer.
iris %<>% as_tibble
# "minus" means put all columns except Species into the nested dataframes
nest(iris, -Species)
```

```
## # A tibble: 3 x 2
## Species data
## <fct> <list>
## 1 setosa <tibble [50 × 4]>
## 2 versicolor <tibble [50 × 4]>
## 3 virginica <tibble [50 × 4]>
```

Next, there is a family of functions `map`

. Just like map functions in other languages and the `Map`

and `lapply`

functions in base R, it takes a function and a vector of inputs, or multiple inputs, and returns the outputs.

```
iris %>%
nest(-Species) %>%
mutate(
# this works
model1 = lapply(data, function(x) lm(Sepal.Width ~ Sepal.Length, data = x)),
# "map" also works
model2 = map(data, function(x) lm(Sepal.Width ~ Sepal.Length, data = x)),
# "map" also allows a shorthand for anonymous functions
model3 = map(data, ~ lm(Sepal.Width ~ Sepal.Length, data = .))
)
```

```
## # A tibble: 3 x 5
## Species data model1 model2 model3
## <fct> <list> <list> <list> <list>
## 1 setosa <tibble [50 × 4]> <S3: lm> <S3: lm> <S3: lm>
## 2 versicolor <tibble [50 × 4]> <S3: lm> <S3: lm> <S3: lm>
## 3 virginica <tibble [50 × 4]> <S3: lm> <S3: lm> <S3: lm>
```

The really nice bit about `map`

is that is has close cousins like `map_dbl`

which specify that the output should be a vector of numbers, rather than a list of numbers. Here’s what I mean:

```
iris %>%
nest(-Species) %>%
mutate(
model = map(data, ~ lm(Sepal.Width ~ Sepal.Length, data = .)),
slope1 = map(model, ~ coef(.)['Sepal.Length']),
slope2 = map_dbl(model, ~ coef(.)['Sepal.Length'])
)
```

```
## # A tibble: 3 x 5
## Species data model slope1 slope2
## <fct> <list> <list> <list> <dbl>
## 1 setosa <tibble [50 × 4]> <S3: lm> <dbl [1]> 0.799
## 2 versicolor <tibble [50 × 4]> <S3: lm> <dbl [1]> 0.320
## 3 virginica <tibble [50 × 4]> <S3: lm> <dbl [1]> 0.232
```

Note that the `map`

for `slope`

in the above gave a list of single numbers, while `map_dbl`

gave a vector that “lays” more nicely in the data frame.

Now it’s nice and easy to compare the models in the context of a data frame:

```
iris %>%
nest(-Species) %>%
mutate(
model = map(data, ~ lm(Sepal.Width ~ Sepal.Length, data = .)),
slope = map_dbl(model, ~ coef(.)['Sepal.Length']),
slope_confint = map(model, ~ confint(.)['Sepal.Length', ]),
slope_cil = map_dbl(slope_confint, first),
slope_ciu = map_dbl(slope_confint, last)
) %>%
ggplot(aes(x = Species, y = slope, ymin = slope_cil, ymax = slope_ciu)) +
geom_point() +
geom_errorbar()
```

Note that this final result only computed each model once, only computed the confindence intervals once, and managed to make the output figure without once creating an intermediate object!