Skip to content

Estimators

Parameters, populations, and samples

Some people refer to statistics (the functions) as sample statistics to emphasize that they are functions of a sample of data that was drawn from some larger population, which has some fixed an unknowable parameters that describe it. For example, if you draw many data points \(x_i\) from a distribution and compute the mean of the drawn data points, you do not expect that the sample mean that you compute will be exactly equal to the true population mean.

In mathematical terms, we say that a random variable \(X\) has some expected value \(\mathbb{E}[X]\) that is fixed. A random variable is a function, and the expected value is another function that links the random variable with a single number. There is nothing "random" about this so far. The analogy to the random sample are iid random variables \(X_i\), so the sample mean is analogous to the estimator \(t(X)\).

Ideally, the estimator \(t(X)\) will actually reflect the parameter \(\mathbb{E}[X]\) that we wanted to learn about. The rest of this section will discuss what makes a "good" estimator.

A useful example: the "German tank" problem

To examine the properties of estimators and what makes an estimator a "good" one, I will use an example that is mathematically tractable but just unfamiliar enough to make you think.

During World War II, it was important to the Allies to know how many tanks Germany was producing. The traditional approach was to use spies and aerial reconnaissance. The new approach was to use statistics on serial numbers, which turned out to be much more accurate.

A serial number is a unique number written on a manufactured part. Serial numbers are usually assigned in sequential order, so that older parts have lower serial numbers and newer parts have higher numbers. German tanks had serial numbers. When the Allies captured German tanks, they took note of those numbers, which gave them a clue about how may thanks there were. For example, if you captured three tanks and found serial numbers 1, 3, and 5, you know there are at least 5 tanks total, and there probably aren't more than 10 or so. If you find serial numbers 100, 300, and 500, then you know there are at least 500 tanks, and there are probably more like 1,000.

The Allies had a fairly complex problem, because they wanted to estimate the rate of production, and they had many serial numbers on many different parts of the tank, some of which were not exactly sequential. Let's instead consider an abstracted, simplified version of this problem. Say you drew \(n\) numbers from uniform distribution ranging from \(A\) to some unknown upper limit \(B\). You want to estimate \(B\), which is analogous to the number of German tanks out there.

Mathematically, we are interested in iid random variables \(X\) that follow the uniform distribution:

\[ \begin{gathered} f_X(x) = \frac{1}{B-A} \text{ for } A \leq x \leq B \\ F_X(x) = \frac{x-A}{B-A} \text{ for } A \leq x \leq B \end{gathered} \]

For simplicity, let's say that \(A=0\) so that \(f_X(x) = 1/B\) and \(f_X(x) = x/B\) for \(0 \leq x \leq B\).

Our challenge is to create a statistic, a function of the data, that estimates the parameter \(B\) and then examine the mathematical properties of the corresponding estimator. It's typical to denote estimators with a "hat", so we will write our estimators like \(\hat{B}\). It's also typical to denote statistics with lower-case letters reminiscent of the true value, so we'll use names like \(b\). To be clear, \(b\) is a function of numbers, although we also write \(\hat{B} = b(X)\), where now "\(b\)" means some manipulation of a set of iid random variables to create a new random variable.

Consistent estimators

Let me start with what will seem like a very inane choice of statistic. Maybe my favorite number is 3, so I will say that, regardless of what data I collect, I will just always guess that \(B\) is 3:

\[ b(x_1, \ldots, x_n) \equiv 3. \]

The corresponding estimator \(\hat{B}\) is very simple. It takes on the value 3 with 100% probability:

\[ f_{\hat{B}}(x) = \begin{cases} 1 &\text{if $x = 3$} \\ 0 &\text{otherwise} \end{cases} \]

This is clearly a bad estimator. No matter how much data I accumulate, my estimate doesn't improve. I will only be "right" if \(B\), by some miracle, happens to be exactly \(3\). By contrast, my expectation would be that, in the limit of collecting a lot of data, the statistic \(b\) would be almost guaranteed to be very close to \(B\). This requirement is called consistency.

Mathematically, I want the probability that \(\hat{B}\) takes on a value far from \(B\) to get close to zero as I collect more and more data. This requires defining what a "limit" is for a series of random variables. For a sequence of numbers, a limit means that, for any a priori, fixed threshold \(\varepsilon > 0\), there is some integer \(n\) such that every value in the sequence after the \(n\)-th one is within \(\varepsilon\) of the true value:

\[ \lim_{n\to\infty} a_n = L \implies \text{for any $\varepsilon > 0$ there exists some $n$ such that } |a_i - L| < \varepsilon \text{ for all $i \geq n$}. \]

For example, given the geometric series \(1, \tfrac{1}{2}, \tfrac{1}{4}, \ldots\), if you pick the threshold \(\varepsilon = \tfrac{1}{100}\), I can pick \(n=7\), since the terms in the series are equal to \(\tfrac{1}{2^n}\), and \(\tfrac{1}{2^7} = \tfrac{1}{128}\). All subsequent terms in the series will be even closer to zero than that term.

In probability, if you give me \(\varepsilon\), in most cases I cannot pick an \(n\) such that the estimator maps all values outside that threshold to zero probability. Casually speaking, even after a trillion data points, there is no guarantee that a series of random data points won't stray from their limit. Instead, in probability, we say that a sequence of random variables \(X_n\) converges toward a number \(a\) if

\[ \lim_{n \to \infty} \mathbb{P}[|X_n - a| > \varepsilon] = 0, \]

that is, if the probability that \(X_n\) takes on a value more than \(\varepsilon\) away from \(a\) approaches zero as \(n\) increases. An estimator is consistent if the series of random variables corresponding to collecting more and more data converge toward the true value.

In our example, \(\hat{B}_1\) corresponds to the estimator when we only collect 1 data point, \(\hat{B}_2\) when we collect 2, and so on. In the dumb example where \(\hat{B}\) only takes on the value 3, the \(\hat{B}_n\) converge to 3, but not to \(B\), except in the unlikely case that \(B\) just happens to be 3.

For our second guess, let's use a more reasonable statistic. Say that \(b\) is the maximum of whatever data points we collected:

\[ b(x_1, \ldots, x_n) = \max_i x_i. \]

I know I can write out an analogous equation to define the estimator:

\[ \hat{B} = \max_i X_i. \]

But what does it mean to take the maximum of two random variables? Recall that, we computed the distribution of a sum of two random variables \(Z = X + Y\) as \(f_Z(z) = \int f_X(x) f_Y(z-x) \,dx\). For a maximum, it is easier to express the new variable's distribution using its cdf. To say \(Z = \max[X, Y]\) means that, if \(X\) and \(Y\) are independent,

\[ \begin{aligned} F_Z(z) &= \mathbb{P}[Z \leq z] \\ &= \mathbb{P}[(X \leq x) \cap (Y \leq y)] \\ &= \mathbb{P}[X \leq x] \, \mathbb{P}[Y \leq y] \\ &= F_X(x) F_Y(y) \end{aligned} \]

So if \(F_X(x) = x/B\), then \(F_{\hat{B}_n} = (x/B)^n\). Note that \(\hat{B}_n\) can only take on values between \(0\) and \(B\), so if it diverges from \(B\), it will do so by falling short:

\[ \begin{aligned} \mathbb{P}\left[ |\hat{B}_n - B| > \varepsilon \right] &= \mathbb{P}[\hat{B}_n \leq B - \varepsilon] \\ &= F_{\hat{B}_n}(B - \varepsilon) \\ &= \left(\frac{B - \varepsilon}{B}\right)^n \\ &= \left(1 - \frac{\varepsilon}{B}\right)^n. \end{aligned} \]

For \(\varepsilon > 0\), this limit of this number, as \(n \to \infty\), is zero, and thus \(\hat{B}_n\) converges to \(B\), so \(\hat{B}\) is a consistent estimator.

Unbiased estimators

It is a nice thing that our estimator \(\hat{B}\) is consistent: as we get more and more data, the statistic \(b\) will end up closer and closer to the true value \(B\). It is disappointing, though, that \(b\) is always an underestimate. If the data points \(x_i\) are randomly distributed between \(0\) and \(B\), and we take \(b = \max_i x_i\), it is necessarily the case that \(b < B\). Because \(b\) is not "centered around" \(B\), we say that \(\hat{B}\) is a biased estimator.

Consider the extreme case where \(n=1\): we draw only one data point \(x\), which is somewhere between \(0\) and \(B\). Intuitively, I might expect \(x\) to be about halfway between \(0\) and \(B\), that is, that \(x \approx \tfrac{1}{2}B\), so that \(b\) underestimates \(B\) by a factor of \(\tfrac{1}{2}\). And if \(n=2\), then I might expect those points, roughly speaking, to land somewhere like \(\tfrac{1}{3}B\) and \(\tfrac{2}{3}B\), so that \(b\) underestimates \(B\) by a factor of about \(\tfrac{1}{3}\). These factors appear to shrink as \(n\) increases, which makes sense, since \(\hat{B}\) is a consistent estimator. But can we do something to figure out, mathematically, what that factor should be?

We do this by examining the expected value of the estimator:

\[ \mathbb{E}[\hat{B]} = \int_0^B x \, f_{\hat{B}}(x) \, dx. \]

We said above that \(F_{\hat{B}} = (x/B)^n\), from which it follows that

\[ f_{\hat{B}}(x) = \frac{d}{dx} F_{\hat{B}} = \frac{n}{B^n} x^{n-1}. \]

Plugging this definition of \(f_{\hat{B}}\) into the integral gives

\[ \mathbb{E}[\hat{B]} = \int_0^B \frac{n}{B^n} x^n \, dx = \frac{n}{n+1} B. \]

This result accords exactly with our intuition above: for \(n=1\), the expected value of \(\hat{B}\) is \(\tfrac{1}{2}B\). For \(n=2\), it is \(\tfrac{2}{3}B\), and so forth. Based on this finding, we can define a new estimator:

\[ \hat{B}_\mathrm{unbiased} \equiv \frac{n+1}{n} \max_i X_i. \]

This ensures that our estimates for \(B\) are "centered" around \(B\), because \(\mathbb{E}[\hat{B}_\mathrm{unbiased}] = B\). More generally, we say that an estimator is unbiased if its expected value is equal to the parameter it is estimating.

Note the subtlety in what we did: without actually knowing what \(B\) is, we found a way to adjust the way that we estimate \(B\) to ensure that we come up with a more accurate estimate. I was mystified in introductory statistics that, while we computed variance as \(\tfrac{1}{n} \sum_i (x_i - \mu)^2\), we computed sample variance as \(\tfrac{1}{n-1} \sum_i (x_i - \mu)^2\). The standard explanation has something to do with "degrees of freedom", which I never understood. There is, I think, a more straightforward explanation, which you can now understand, which is that the \(n-1\) is there in the sample variance, instead of \(n\), because it corrects for a bias.

Efficient estimators

Consistency ensures that, in the limit of infinite data, our estimate approaches the true value. How can we be sure that we are getting the best estimate for our data, given the fact that we can't collect infinite data? This question relates to the concept of efficiency. One unbiased estimator \(\hat{X}_1\) is more efficient than another unbiased estimator \(\hat{X}_2\) if it has lower variance:

\[ \mathbb{V}[\hat{X}_1] < \mathbb{V}[\hat{X}_2]. \]

As we will see, confidence intervals are related to the variance of an estimator, so a more efficient estimator means that we can get more narrow confidence intervals for the same amount of data, which is clearly a desirable thing. I don't want to appear more ignorant than I have to be, just because I picked a poor estimator!

It turns out that there is a theoretical lower limit to the variance of estimators called the Cramér-Rao bound, and many estimators actually hit this limit. Therefore, rather than saying that one estimator is more efficient than another, we usually just say that an estimator is "efficient", meaning that it hits the lower bound on variance and is maximally efficient.

The math behind efficiency is somewhat complex, so I will simply tell you that a whole class of estimators, maximum likelihood estimators, are efficient.

Maximum likelihood estimators

The German tank problem has a useful toy example, but it's hard to imagine deriving estimators by hand for the kind of complex data analysis that most scientists do. Fortunately, modern computing means that we can address all sorts of complex data using a maximum likelihood (ML) approach. In a maximum likelihood approach, you specify a theoretical model for how your data are generated, and then you use optimization to estimate values for the parameters that best "fit" the data.

The word likelihood is confusing because, in common speech, "probability" and "likelihood" are synonymous. In statistics jargon, the likelihood function is just the probability density function, except that it treats the parameters as the varying input and the data as fixed. Say we have some data \(x\) that was drawn from a probability distribution with density function \(f_X\). We often think about families of probability distributions and name them by their parameters. For example, we say that a normal distribution is defined by two parameters, its mean and its variance. So say that the density function \(f_X\) changes depending on some parameter (or parameters) \(\theta\). We write the density function as \(f_X(x; \theta)\) to emphasize that, while the parameters \(\theta\) are relevant, we think of them as fixed. By contrast, we write the likelihood function \(L\) as \(L(\theta; x) \equiv f_X(x; \theta)\). For likelihood, we consider the data fixed and vary the parameters.

To see how this plays out, let's return to the German tank problem. Given some data points \(x_i\), what is the probability density of drawing that data, for some varying parameter \(B\)? We look for the value of \(B\) that maximizes the probability density function for all the measured data points \(x_i\). This kind of optimization is called argmax. While max asks for the maximum value of some function \(f(x)\) when varying \(x\), argmax asks for the value of \(x\) that leads to \(f(x)\) being maximized. Thus:

\[ \begin{aligned} \hat{B}_\mathrm{ML} &= \argmax_B \prod_i f_X(x_i; B) \\ &= \argmax_B \prod_i \begin{cases} 1/B & 0 \leq x_i \leq B \\ 0 & x_i > B \end{cases} \\ &= \argmax_B \begin{cases} (1/B)^n & x_i > B \text{ for any $i$} \\ 0 & \text{otherwise} \end{cases} \\ &= \max_i x_i \end{aligned} \]

Note that, in the maximum likelihood approach, when computing \(f_X(x_i; B)\), we assume that we know \(B\), because it is the parameter we are optimizing over. This makes the maximum likelihood computation relatively simple: we cannot choose \(\hat{B}_\mathrm{ML}\) to be smaller than the maximum of the observed values \(x_i\), because that would make the observed data impossible, and we set \(\hat{B}_\mathrm{ML}\) at exactly \(\max_i x_i\) because choosing any greater value would make the observed data seem unusual. If \(B\) were actually much, much higher than \(\max_i x_i\), the observed data would be rare, being clustered very close to zero, relative to that very large \(B\).

Note also that the maximum likelihood approach for the German tank problem gave us a consistent estimator, but not an unbiased one. Given certain mathematical necessities that are probably true for your applications, maximum likelihood estimators are consistent, but they are in general not biased.

Again, a key practical advantage of maximum likelihood estimation is that it requires only a computer and an articulation of how the data are generated.

The arithmetic mean as an estimator

The German tank problem has simple math, but it is not a familiar application. Let us consider instead a very familiar estimator, the arithmetic mean, that is, the simple average of numbers. Just as the mean of some set of numbers \(x_i\) is \((1/n) \sum_i x_i\), we analogously define the estimator \(\hat{\mathbb{E}}_X \equiv (1/n) \sum_i X_i\) of the mean of i.i.d. random variables \(X_i\).

It is immediately clear that \(\hat{\mathbb{E}}_X\) is an unbiased estimator of \(\mathbb{E}[X]\):

\[ \mathbb{E}[\hat{\mathbb{E}}_X] = \mathbb{E}\left[\frac{1}{n} \sum_i X_i \right] = \frac{1}{n} \mathbb{E}[X_i] = \mathbb{E}[X]. \]

But is \(\hat{\mathbb{E}}_X\) as consistent estimator of \(\mathbb{E}[X]\)? The mathematician Jacob Bernoulli published the first proof of this fact for a special case of random variables (i.e., those following the Bernoulli distribution) in 1713. It took him 20 years to develop a rigorous proof for what he proudly called the "Golden Theorem". Now the proof is considerably simpler and broader, and we call it the law of large numbers.

The modern proof of the law of large numbers follows from an observation about the variance of the estimator \(\hat{\mathbb{E}}_X\) that was surprising to many early statisticians:

\[ \mathbb{V}\left[\hat{\mathbb{E}}_X \right] = \mathbb{V}\left[\frac{1}{n} \sum_i X_i \right] = \frac{1}{n^2} \sum_i \mathbb{V}[X_i] = \frac{1}{n} \mathbb{V}[X] \]

In words, the variance of the estimator is \(1/n\) of the variance of the underlying random variable it is estimating. Intuitively, this means that, as you take more data, you are ever more likely to get an estimate near to the true mean of the underlying population you are sampling. This result makes it clear that \(\hat{\mathbb{E}}_X\) is a consistent estimator, because for arbitrarily high \(n\), it has arbitrarily low variance.

Because the standard deviation, the square root of the variance, is more intuitive, one can say that the standard deviation of the estimator decreases with the square root of \(n\). To distinguish the standard deviation of the estimator from the standard deviation of the underlying population, \(\mathbb{V}[\hat{\mathbb{E}}_X]\) is also called the standard error of the mean. With more data points \(n\), you can get an ever more precise estimate of the mean, so the standard error of the mean declines, but the standard error of the underlying population \(\mathbb{V}[X]\) is fixed.

This result was surprising because it was expected that the standard error would scale like \(1/n\) rather than \(1/\sqrt{n}\). In other words, it was expected that, if you collected double the data, you would get double the precision. In fact, when you double the data, you get only an improvement of \(\sqrt{2}\), of 41%, in standard error. More starkly, if you take one hundred times the data, you get only a ten-fold increase in precision.