Additive and multiplicative risks for public health

Published:

I saw this Comment on a paper in The Lancet Global Health and thought is was a lovely example of why logistic regressions can be confusing.

The paper is a study of children with malnutrition and diarrhea. These two conditions are a vicious cycle: malnutrition weakens the immune system, which increases the risk of diarrhea caused by infections, which reduces children’s appetites and abilities to feed, leading to further malnutrition. This cycle is an important problem in global health.

In one of its analyses, the paper looks at the risk of death as a function of nutrition status (malnourished vs. better-nourished) and pathogen (infected with Cryptosporidium vs. uninfected or infected with some other pathogen). The paper raises the question: is Cryptosporidium, a parasite that causes diarrhea, more dangerous for malnourished children than better-nourished children? In other words, will malnourished children get a greater benefit from some public health intervention that reduces Cryptosporidium infections, compared to better-nourished children?

Here are some of the relevant data discussed in the Comment:

Nutrition Infection Deaths Total Crude risk
Malnourished Cryptosporidium 22 196 11.2%
  Other 70 875 8.0%
Better-nourished Cryptosporidium 11 686 1.6%
  Other 41 5697 0.7%

If you simply compare the crude risks among malnourished and among better-nourished children, it appears that Cryptosporidium is more problematic for malnourished children: the increase in the risk of death among malnourished children is 3.2% ($11.2 - 8.0$) compared to 0.9% ($1.6 - 0.7$) among the better-nourished children.

But these are not enormous effects, so I would want to do a regression to see whether these results actually support this conclusion. My first instinct is to do a logistic regression, predicting each child’s outcome (death or survival) from their nutrition status, infection status, and an interaction term. The interaction term will tell us whether being malnourished and having Cryptosporidium is worse than you would expect based on the two increases in risk from being malnourished and from having a Cryptosporidium infection.

First, I’ll put the data into R using the handy tribble function:

library(tidyverse)

count_data <- tribble(
  ~malnourished, ~crypto, ~deaths, ~total,
  TRUE, TRUE, 22, 196,
  TRUE, FALSE, 70, 875,
  FALSE, TRUE, 11, 686,
  FALSE, FALSE, 41, 5697
)

# # A tibble: 4 x 4
#   malnourished crypto deaths total
#   <lgl>        <lgl>   <dbl> <dbl>
# 1 TRUE         TRUE       22   196
# 2 TRUE         FALSE      70   875
# 3 FALSE        TRUE       11   686
# 4 FALSE        FALSE      41  5697

Now I want to take this count data and “uncount” it, so that each row represents one child:

data <- count_data %>%
  mutate(death = map2(deaths, total, ~ c(rep(1, .x), rep(0, .y - .x)))) %>%
  select(malnourished, crypto, death) %>%
  unnest()

# # A tibble: 7,454 x 3
#    malnourished crypto death
#    <lgl>        <lgl>  <dbl>
#  1 TRUE         TRUE       1
#  2 TRUE         TRUE       1
#  3 TRUE         TRUE       1
#  4 TRUE         TRUE       1
#  5 TRUE         TRUE       1
#  6 TRUE         TRUE       1
#  7 TRUE         TRUE       1
#  8 TRUE         TRUE       1
#  9 TRUE         TRUE       1
# 10 TRUE         TRUE       1
# # … with 7,444 more rows

(There’s a tidyverse uncount function, but it doesn’t do exactly what I want, since I have two pieces of information per row.)

I’ll quickly check that I did the transformation right by looking at the crude risks:

data %>%
  group_by(malnourished, crypto) %>%
  summarize(crude_risk = scales::percent(mean(death)))

# # A tibble: 4 x 3
# # Groups:   malnourished [2]
#   malnourished crypto crude_risk
#   <lgl>        <lgl>  <chr>
# 1 FALSE        FALSE  0.720%
# 2 FALSE        TRUE   1.60%
# 3 TRUE         FALSE  8.00%
# 4 TRUE         TRUE   11.2%

And then let’s run the regression:

summary(glm(formula = death ~ malnourished * crypto, family = "binomial", data = data))

# Call:
# glm(formula = death ~ malnourished * crypto, family = "binomial",
#     data = data)
#
# Coefficients:
#                             Estimate Std. Error z value Pr(>|z|)
# (Intercept)                  -4.9269     0.1567 -31.437   <2e-16 ***
# malnourishedTRUE              2.4846     0.2002  12.409   <2e-16 ***
# cryptoTRUE                    0.8101     0.3420   2.369   0.0178 *
# malnourishedTRUE:cryptoTRUE  -0.4357     0.4286  -1.017   0.3093
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

So, as expected, being malnourished increases your risk of death (that’s the 2.4 estimate, corresponding to an odds ratio of $e^{2.4} \approx 11$), and having a Cryptosporidium infection does too (estimate of 0.8 yields an odds ratio of $e^{0.8} \approx 2.2$).

If you were a skeptic, you might also not be surprised that the interaction term (malnourishedTRUE:cryptoTRUE) is not statistically significant ($p = 0.3$). But even if you were a skeptic, you might be surprised that the interaction term is negative, suggesting that Cryptosporidium infection presents less of an increase in risk for malnourished children when compared to better-nourished children.

And indeed this is true, depending on what your mean by “less” or “more” risk. The trick is that logistic regression works on a multiplicative, rather than additive scale, and the multiplicative increase in risk of death from having Cryptosporidium is less for malnourished children. For better-nourished children, the additive increase in risk was only 0.9 percentage points, but this was on a baseline risk of 0.7%: Cryptosporidium infection more than doubles their risk of death. For malnourished children, on the other hand, the additive increase in risk was 3.2 percentage points, which is certainly bigger than 0.9 percentage points, but it’s less than half of the baseline risk of 8.0%.

For public health, the additive risk can more useful, since it let’s you say: “Say I could completely eliminate Cryptosporidium infections either in a population of N malnourished children or in a population of N better-nourished children. Where could I save more lives?” This is more interesting than the multiplicative reasoning, which asks, “Where could I lead to the greatest fractional reduction in the number of deaths?” You’d save more lives by preventing infection among the malnourished children (assuming the observed effect is causal).

Although it’s problematic, I’ll show the results of a binomial-linear regression, where the risks add rather than multiply:

summary(glm(death ~ malnourished * crypto, family = binomial(link = "identity"), data = data))

# Call:
# glm(formula = death ~ malnourished * crypto, family = binomial(link = "identity"),
#     data = data)
#
# Deviance Residuals:
#     Min       1Q   Median       3Q      Max
# -0.4880  -0.1202  -0.1202  -0.1202   3.1414
#
# Coefficients:
#                             Estimate Std. Error z value Pr(>|z|)
# (Intercept)                 0.007197   0.001120   6.426 1.31e-10 ***
# malnourishedTRUE            0.072803   0.009240   7.880 3.29e-15 ***
# cryptoTRUE                  0.008838   0.004925   1.795   0.0727 .
# malnourishedTRUE:cryptoTRUE 0.023407   0.024835   0.942   0.3459
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

As expected, the interaction term is now positive (although still not significant).