Using R for many statistical tests

Published:

This post is in some senses a remix of a previous post about using purrr for analysis in R.

As part of a scientific project, I was asking whether the distribution of antibiotic use is important to levels of antibiotic resistance. Say you have two populations of people, each taking the same average, per capita number of antibiotics, but in one population, consumption is evenly spread, while in the other it is concentrated in a few people. Do we expect one to have more resistance than the other?

To answer this, we looked at antibiotic use and resistance across US states in 72 pathogen/antibiotic combinations (e.g., S. pneumoniae resistance to penicillin antibiotics). We partitioned antibiotic use into first use and repeat use. One person’s first use for an antibiotic a calendar year is either 0, if they did not take that antibiotic, or 1, if they took that antibiotic at least once. Repeat use is 0 if they never took the antibiotic, 0 if they took it once, 1 if they took it twice, 2 if they took it 3 times, etc.

The base, or “null” model, is that first and repeat use have the same association with resistance:

\[A_i = \mu + \beta_\mathrm{total} (F_i + R_i) + \varepsilon_i,\]

where $A_i$ is Antibiotic resistance in the $i$-th state, $\mu$ is the grand mean, $F_i$ is first use, $R_i$ is repeat use, and $\varepsilon$ is an error term.

The “alternative” model is that first and repeat use have different associations with resistance:

\[A_i = \mu + \beta_F F_i + \beta_R R_i + \varepsilon_i\]

We want to compare these two models, using a likelihood ratio test to see if the improved fit that comes from having two separate predictors ($F_i$ and $R_i$) is evidence that $\beta_F \neq \beta_R$, that is, that we should accept that we have reason to believe the alternative model better reflects the underlying reality than the base model.

First, I download the relevant data from the supplemental data for the paper:

library(tidyverse)
library(lmtest)

use <- read_tsv("https://elifesciences.org/download/aHR0cHM6Ly9jZG4uZWxpZmVzY2llbmNlcy5vcmcvYXJ0aWNsZXMvMzk0MzUvZWxpZmUtMzk0MzUtZmlnMy1kYXRhMS12Mi50ZHM=/elife-39435-fig3-data1-v2.tds?_hash=KZsZkTZ14Oh8yKruvuxMT1Ja%2Fy0bojoI24mne8gvd9Q%3D")
res <- read_tsv("https://elifesciences.org/download/aHR0cHM6Ly9jZG4uZWxpZmVzY2llbmNlcy5vcmcvYXJ0aWNsZXMvMzk0MzUvZWxpZmUtMzk0MzUtZmlnMy1kYXRhMi12Mi50ZHM=/elife-39435-fig3-data2-v2.tds?_hash=IMcMpKgsfLWKByV6%2B5zVYDoIMFgE2VNWV43ZR5xME8s%3D")

Then, I select only the relevant data (the main dataset used in the paper) and join the use and resistance data together:

data <- use %>%
  filter(source == "marketscan", data == "main") %>%
  inner_join(res, by = c("drug_group", "state_id")) %>%
  select(
    state_id, bug = pathogen, drug = drug_group,
    first_use, repeat_use, res = proportion_nonsusceptible
  ) %>%
  mutate(bug_drug = interaction(bug, drug))

The first way to check if $\beta_F \neq \beta_R$ is to run a global model, incorporating all 72 pathogen/antibiotic combinations at once:

null_model <- lm(res ~ bug_drug * I(first_use + repeat_use), data = data)
alt_model <- lm(res ~ bug_drug * (first_use + repeat_use), data = data)
lrtest(null_model, alt_model)

# Likelihood ratio test
#
# Model 1: res ~ bug_drug * I(first_use + repeat_use)
# Model 2: res ~ bug_drug * (first_use + repeat_use)
#   #Df LogLik Df  Chisq Pr(>Chisq)
# 1 145 2460.1
# 2 217 2482.1 72 44.025     0.9962

The result $p = 0.996$ tells you that the data appear to be much better explained by the simpler, “null” model with $\beta_F = \beta_R$, and therefore first and repeat use don’t have different impacts on resistance.

But is it the case that we have good evidence for some particular pathogen/antibiotic combinations? To run separate models on each combination, we can use tidyr’s nest function with purrr’s map functions:

results <- data %>%
  nest(-bug_drug) %>%
  mutate(
    null_model = map(data, ~ lm(res ~ I(first_use + repeat_use), data = .)),
    alt_model = map(data, ~ lm(res ~ first_use + repeat_use, data = .)),
    test = map2(null_model, alt_model, lrtest),
    p = map_dbl(test, ~ .$`Pr(>Chisq)`[2])
  )

The nest command makes a tibble (aka “data frame”) that has a column named data that is, itself, a bunch of tibbles:

data %>% nest(-bug_drug)

# A tibble: 72 x 2
#   bug_drug                  data
#   <fct>                     <list>
# 1 C. freundii.beta_lactam   <tibble [37 × 6]>
# 2 CoNS.beta_lactam          <tibble [42 × 6]>
# 3 E. cloacae.beta_lactam    <tibble [41 × 6]>
# 4 E. coli.beta_lactam       <tibble [44 × 6]>
# 5 E. faecalis.beta_lactam   <tibble [39 × 6]>
# 6 E. faecium.beta_lactam    <tibble [38 × 6]>
# 7 K. pneumoniae.beta_lactam <tibble [43 × 6]>
# 8 P. aeruginosa.beta_lactam <tibble [44 × 6]>
# 9 P. mirabilis.beta_lactam  <tibble [42 × 6]>
#10 S. aureus.beta_lactam     <tibble [43 × 6]>
# … with 62 more rows

For example, the first row, showing $\beta$-lactam resistance among C. freundii, has a data fields that is a tibble with 37 rows and 6 columns:

data %>% nest(-bug_drug) %>% pull(data) %>% first()

# A tibble: 37 x 6
#    state_id bug         drug        first_use repeat_use    res
#    <chr>    <chr>       <chr>           <dbl>      <dbl>  <dbl>
#  1 stateBFU C. freundii beta_lactam    0.106      0.0398 0.540
#  2 stateCCZ C. freundii beta_lactam    0.189      0.0885 0.483
#  3 stateCUX C. freundii beta_lactam    0.157      0.0705 0.2
#  4 stateCWD C. freundii beta_lactam    0.145      0.0609 0.17
#  5 stateCYL C. freundii beta_lactam    0.188      0.0837 0.509
#  6 stateELN C. freundii beta_lactam    0.0728     0.0260 0.0248
#  7 stateEPB C. freundii beta_lactam    0.177      0.0710 0.543
#  8 stateERZ C. freundii beta_lactam    0.144      0.0632 0.584
#  9 stateEVG C. freundii beta_lactam    0.137      0.0563 0
# 10 stateGKW C. freundii beta_lactam    0.162      0.0678 0.330
# … with 27 more rows

The map functions makes new columns that are linear regression objects. It’s worth noting the nifty feature in the formulas:

lm(res ~ I(first_use + repeat_use), data = .) means $A_i = \beta_\mathrm{total} (F_i + R_i) + \varepsilon_i$

r lm(res ~ first_use + repeat_use, data = .) means $A_i = \beta_F F_i + \beta_R R_i + \varepsilon_i$

Then we want to filter the rows based on the $p$-values:

results %>%
  count(p < 0.05, p.adjust(p, "BH") < 0.05)

# A tibble: 3 x 3
#   `p < 0.05` `p.adjust(p, "BH") < 0.05`     n
#   <lgl>      <lgl>                      <int>
# 1 FALSE      FALSE                         65
# 2 TRUE       FALSE                          6
# 3 TRUE       TRUE                           1

So in 7 pathogen/antibiotic combinations we have $p < 0.05$, but only 1 do we have statistical evidence after multiple hypothesis correction, and in this case, the results are not very compelling:

# get the use/resistance data for the significant bug/drug
sig_result <- results %>%
  filter(p.adjust(p, "BH") < 0.05) %>%
  as.list() %>%
  flatten()

sig_result$data %>%
  # "gather" the first and repeat use for plotting
  gather("use_type", "use", c("first_use", "repeat_use")) %>%
  ggplot(aes(use, res)) +
  # show the first/resistance and repeat/resistance relationships
  facet_wrap(~ use_type, scales = "free_x") +
  # show linear best first lines
  stat_smooth(method = "lm") +
  geom_point()

So when looking at first and repeat resistance on their own, we see no relationship with first use, and a potentially negative association with repeat use. The coefficients that come out of the alternative model are a bit crazy:

sig_result$null_model

# Call:
# lm(formula = res ~ I(first_use + repeat_use), data = .)
#
# Coefficients:
#               (Intercept)  I(first_use + repeat_use)
#                    0.2267                    -1.7958

sig_result$alt_model

# Call:
# lm(formula = res ~ first_use + repeat_use, data = .)
#
# Coefficients:
# (Intercept)    first_use   repeat_use
#      0.1637      38.7258     -86.7541

So while the null model gives a slope of $-1.8$, the alternative model gives enormous slopes: $+38$ and $-87$! It turns out that this is a problem of collinearity: first use and repeat use are highly correlated, which leads to weird predictions in linear models:

with(sig_result$data, cor.test(first_use, repeat_use))

#     Pearson's product-moment correlation
#
# data:  first_use and repeat_use
# t = 11.711, df = 33, p-value = 2.691e-13
# alternative hypothesis: true correlation is not equal to 0
# 95 percent confidence interval:
#  0.8055830 0.9475536
# sample estimates:
#       cor
# 0.8978064

So we are left with the conclusion that we don’t have strong evidence to support the idea that $\beta_F \neq \beta_R$ and that the distribution of use is important, when sliced and diced in this way.