Simpson’s paradox has a pretty good explanation on Wikipedia, but Simpson’s original paper also has a great example.

He suggests that we imagine that there is some investigator who “wished to examine whether in a pack of cards the proportion of court cards (King, Queen, Knave) was associated with colour”. Imagine you didn’t know what was in a deck of cards: do you expect more royals among the black cards than among the red?

He suggests that we further imagine that “the pack which he examined was one with which Baby had been playing, and some of the cards were dirty.” (Who this capital-B Baby is is not described.) The investigator, being pretty ignorant about cards, records the royal-ness (i.e., face card or not), color, and dirtiness of each card.

He finds these results:

Dirty Clean
Royal Plain Royal Plain
Red 4 8 2 12
Black 3 5 3 15

Interestingly, black cards are more likely to be royal among the dirty cards (3/5 = 60% for black vs. 4/8 = 50% for red) and among the clean cards (3/15 = 20% vs. 2/12 = 17%). However, if you collapse the dirty and clean cards, you find that there are 6 royal cards and 20 plain cards for both black and red. This “provides what we would call the sensible answer, namely, that there is no such association”.

Simpson then suggests that we change the labels to imagine that the investigator had actually done a medical experiment: dirty becomes “male”, clean becomes “female”, royal becomes “did not get therapy”, plain becomes “got therapy”, red becomes “got better”, and black becomes “stayed ill”. Now it’s less clear what the “sensible” answer is. Men who got the drug were more likely to recover than men who did not, and women who got the drug than women who did not, but, overall, the people who got the drug were just as likely to recover as those who did not.

Simpson concludes that “[t]he treatment can hardly be rejected as valueless to the race when it is beneficial when applied to males and to females.” In other words, we find it reasonable to expect that men might have a different recovery rate from women when untreated, and men might have a different recovery rate from women when treated, and that the relationship between those rates within each sex could be different.

The tricky question is to figure out when this kind of division is sensible. It was sensible to divide up participants in medical experiment by sex, but it wasn’t sensible to divide up a census of playing cards by their smudginess.