April 1, 2017 by scott

# History of statistics 3: Use and abuse of the normal

*This is the third of three posts drawn from Stigler’s* History of Statistics.

The last post ended with the Gauss-Laplace synthesis, in which it was realized that the normal, or Gaussian, distribution is a very special one: not only is the sum of many independent random variables a normal distribution, the normal is also closely tied up with the simple, arithmetic mean and the method of least squares.

As I mentioned last time, although the roots of the mathematics of statistics are in physical sciences—astronomy and geodosy—there was a lot of interest in answering what we would now call statistical questions in the social sciences. I already mentioned the question of the sex ratio at birth: more babies born are male than female, but the effect is small enough that a careful thinker would need the math from the binomial distribution to be able to tell if it was real or not. (The surplus women came later in the 1850s. In that case, people were alarmed that there were more women than men in England. Because the economic and social role of women was seen entirely in the frame of marriage, this surplus meant that there was a demographically-certain, looming cohort of people who were going to be poor and sad. The interest in the birth ratio precedes this panic.)

A very practical social question, the census, engaged (among other people) Laplace and a younger mathematician, Adolphe Quetelet. Before statistical thinking, the only way to get a good census was to count as many people as you could. In a sense, this is still true: the way to get the most accurate census is to count everyone, which is effectively impossible. Laplace had suggested a method that we would now call random sampling: the French kept very accurate records of numbers of births, but the knowledge about the ratio of living population to number of births was more sketchy. Laplace proposed to randomly sample communities from across France, do careful measurements of the population-to-births ratio in each one and use those few communities to estimate the overall population-to-births ratio. Populations-to-births times births equals population.

Quetelet, who was Belgian, started to use "Laplace’s method" in the Low Countries. One critic of this approach was Baron de Keverberg (who is sufficiently obscure to not appear in the English wikipedia). Keverberg wrote in a letter to Quetelet that a random sample could never hope to account for all the variation in the true population:

The law regulating mortality is composed of a large number of elements: it is different for towns and for the flatlands, for large opulent cities and for smaller and less rich villages, and depending on whether the locality is dense of sparsely populated. This law depends on the terrain (raised or depressed), on the soil (dry or marshy), on the distance to the sea (near or far), on the comfort or distress of the people, on their diet, dress, and general manner of life, and on a multitude of local circumstances that would elude any a priori enumeration. […] It is then doubtful that we will often find population regions which, in this regard, can be assimilated the one with the other, and combined in the same category. If such a division of the kingdom could be accomplished on an approximately exact basis, it is likely that it would consist of such a large number of parts that there would be little advantage in terms of work saved.

In other words, the different parts of the country are so different that you could never enumerate all the "types" of communities, and even if you could you would never find a representative sample of that true diversity among the actual communities, and even if you could then you would be doing so much work anyway that you might as well do an actual, full census.

Quetelet took this argument to heart and spent some time showing that there actually was a great diversity the birth rates, that birth rate seemed to vary with average temperature, and so on. He was also interested in criminal convictions: how did conviction rate vary with sex, age, literacy, and education? (In fact, Poisson and Laplace continued a long line of contentious research that treated jurors’ decisions as a Bernoulli trial. Naturally people didn’t like the idea that something as weighty as the legal process could be considered a random variable.)

When Quetelet was exposed to Laplace’s central limit theorem—which states that a sum of random variables converges to a normal distribution—he made a logical fallacy, specifically, affirming the consequent, and reasoned that, if data *did* follow a normal distribution, then those data must have been drawn from a homogeneous population. This idea, had it been true, would have been the perfect counter to Keverberg: if you lump some set of communities together (say, raised terrain and depressed terrain) and you find that the birth rates from those communities are normally distributed, then you are justified in having ignored the different in terrain and treated those communities as a homogeneous population.

In 1965, Shapiro and Wilk developed a formal test to determine if data were normally distributed. Quetelet had no such machinery, so instead he compared the empirical distribution of his data against the theoretical Gaussian distribution. For example, say you have *N* people whose mean height is \(\mu\). Quetelet would compare how many people actually have heights between \(\mu\) and \(\mu + \Delta\) and what the model predicted, then for \(\mu + \Delta\) to \(\mu + 2\Delta\), and so on. Quetelet would then manually inspect the values and decide if they were good enough. This was not a very sensitive test, and Quetelet found normal distributions everywhere.

It took about 25 years for someone to figure out a convincing way to show that Quetelet’s assumption—that normality implies homogeneity—was incorrect. That "proof" came from Fracis Galton, famous for instigating the first regression analyses. Galton started playing with this machine, the quincunx or "Galton box", and this argument was based on what he saw. The Galton box is like a plinko or pachinko board: balls are dropped in at the top, and the balls bounce over pegs distributed throughout the board. Because each peg shunts each ball to the left or right mostly at random, the balls settle at the bottom in the shape of an approximately normal distribution.

Galton noticed, however, that if you stopped the balls in the middle of the board, you would also get a normal distribution, and if you dropped just a slice of those intermediate balls to the bottom, you would get a little normal distribution at the bottom centered around that slice. But, if you drop *all* the intermediate balls, it’s as if you didn’t put the middle bar in at all, and you get the original distribution back. Thus, a normal distribution looks just like a sum of multiple other normal distributions!

I think this turn of events is so interesting because, while invalidating Quetelet’s argument, it shows that you don’t necessarily need a homogenous population to get a well-behaved distribution of results. If other words, Keverberg might be right that there are many influences, but these many influences might cancel one another out in a way that validates ignoring them.

Also, having a physical science training and working the biological sciences, I often feel like I’m living in Keverberg Land. Keverberg and Quetelet didn’t have any idea about the magnitude of the effects from the influences that he was talking about, and, worse yet, they didn’t have any mathematically-sound way to reason about analyses of variance or ways to assess the combinations of factors. In biology, I feel like we’re often stuck at the first level: it’s entirely unclear what the effect size will be from some changing variable. For example, it can be very hard to predict how biological results will change if you switch from male to female test subjects (or vice versa) or switch from animal to human subjects. Quetelet spent a lot of time and made a lot of plots and tables examining how different factors could affect different outcomes (like birth rates or juries’ decisions), and it didn’t really make a huge difference in the end.

## Leave a Reply