## Parameter Estimation - Part 2 (German Tank Problem)

Preliminaries: Parameter Estimation - Part 1 (Reader Request)

Part 1 of this post introduced the concept of calculating statistics based on sample data and using them to estimate the unknown values of population parameters. As an estimator's predictive quality depends on both its bias and its variance, unbiasedness and low variance are clearly desirable qualities. In this post, we will derive and compare various estimators in the context of a concrete example, the German Tank Problem.

### Historical background

During WWII, the Economic Warfare Division of the US Embassy in London utilized a statistical method to estimate the monthly production rate of German tanks (among other equipment). While standard intelligence estimates (based, for example, on widely divergent POW reports or projections based on pre-war production capacity) amounted to 1,400 tanks per month between June 1940 and September 1942, the statistical analysis yielded a far lower figure: 256 per month. After the war, captured government records revealed the actual figure to be 255.

This statistical analysis was based on serial number markings on parts of captured or destroyed tanks. For example, the serial numbers on various wheels yielded an estimate of the total number of wheel molds in use, and British road-wheel manufacturers provided data on the number of wheels that could be produced from a single mold. Thus, if there are 32 wheels per tank, the estimated number of molds in use, times the number of wheels produced per mold per month, divided by 32, would be an estimate for the number of tanks produced in a given month.

Analysis of the 64 wheels from two captured tanks produced an estimate of 270 tanks produced in February 1944, while government records indicated 276 tanks for that month. The statistical method was thus quite effective, even based on a relatively small number of captured tanks. The Allied statisticians also analyzed wheels, gearboxes, chassis, and engines in order to cross-check their results.

Ignoring the particulars of wheel molds and serial numbering conventions, the mathematical statement of the estimation problem now known as the German Tank Problem is quite simple: suppose we have a population of $N$ items with labels $1,2, \dotsc, N$. Given a random sample of $k$ items with labels $x_1, x_2, \dotsc, x_k$, how do we estimate the population size $N$?

We would like an estimator for $N$ which is unbiased and has low variance. Without loss of generality, we can assume that the labels are in order, i.e. $x_1 < x_2 < \dotsb < x_k$. Furthermore, we define the sample maximum $m := x_k$.

### Sample maximum: a crude estimator

One obvious candidate for an estimator of $N$ is simply $m$. However, common sense dictates that this estimate must be biased downward, as $m$ cannot exceed $N$. Thus, the expected value of $m$ should be less than or equal to $N$.

Indeed, the probability of obtaining a sample with maximum $m$ is the number of ways to choose a sample with maximum $m$ divided by the number of possible samples. To choose a sample with maximum $m$, we must choose $m$ (there is only one way to do this) as well as $k-1$ additional sample values from the set $\lbrace 1, 2, \dotsc, m-1 \rbrace$ in one of ${m-1} \choose {k-1}$ ways, so the probability is $${\Bbb P}\left[ x_k = m \right] = \frac{{m-1}\choose{k-1}}{{N}\choose{k}}$$ and the expected value of $m$ is: \begin{align} {\Bbb E}[m] &= \sum_{j=k}^{N}{j \frac{{j-1}\choose{k-1}}{{N}\choose{k}}} \\[2mm] &= \frac{1}{N \choose k} \sum_{j=k}^{N}{j \frac{(j-1)!}{((j-1)-(k-1))! (k-1)!}} \\[2mm] &= \frac{1}{(k-1)!{N \choose k}} \sum_{j=k}^{N}{\frac{j!}{(j-k)!}} \\[2mm] &= \frac{k!}{(k-1)!{N \choose k}} \sum_{j=k}^{N}{j \choose k} \\[2mm] &\buildrel{(*)} \over{=} k \frac{{N+1}\choose{k+1}}{N \choose k} \\[2mm] &= \frac{k}{k+1}(N+1) \end{align} which does not equal $N$, except when $k=N$. Thus, $m$ is a biased estimator of $N$. The bias does become small when the sample size is a significant proportion of $N$, but this is not usually the case in practice.

$(*)$ Note: this equality follows from the Hockey-stick Identity for binomial coefficients: $\sum_{j=k}^{N}{j \choose k} = {{N+1} \choose {k+1}}$

 Expected value of $m$ for various values of sample size $k$ (with $N=250$). Notice that the bias is sizable until $k$ is ~20-30, i.e. 8-12% of $N$.

### A few unbiased estimators

We can employ a few common sense observations to derive unbiased estimators of $N$.

Firstly, let $\psi$ be the middle value of the list $1, 2, \dotsc, N$, i.e. $\psi = \frac{1}{2}(N+1)$ (this formula works whether $N$ is odd or even). Then $N=2\psi-1$, and the sample mean $\overline{x} = \frac{1}{k}\sum_{i=1}^{k}{x_i}$ is an unbiased estimator of $\psi$ (you can prove this using induction, for example). This, in turn, implies that $N_1 := 2\overline{x}-1$ is an unbiased estimator of $N$.

The issue with $N_1$ is that it can be less than the sample maximum $m$. For example, if we have a sample of size $k=3$ with $x_1 = 2, x_2 = 3, x_3=10$, then $\overline{x}=5$, and $N_1=9$, but $m=10$. This is an undesirable estimator as we know for sure that $N \geq 10$.

A more sophisticated method for deriving estimates revolves around analysis of the expected "gaps" between observations in the sample. For example, we may expect that, on average, the sample observations $x_1, x_2, \dotsc, x_k$ will be distributed roughly evenly throughout the interval $1, 2, \dotsc, N$. Therefore, we expect the number of unobserved labels below $x_1$ (which is observable from our sample, as the lowest label is $1$) to be equal, on average, to the number of unobserved labels above $m=x_k$ (which is not observable, as $N$ is unknown). Thus, we define our second estimator to be $N_2 := m + (x_1 - 1)$.

We can further refine $N_2$ by replacing the "gap" $(x_1 - 1)$ with the average of all the gaps in our sample, thus more fully utilizing the sample data. The average gap size in the sample is \begin{align} G_{\text{avg}} &= \frac{1}{k} \left[ (x_1 - 1) + (x_2 - x_1 - 1) + \dotsb + (x_k - x_{k-1} - 1) \right] \\[2mm] &= \frac{1}{k} \left[ x_k - k \right] \\[2mm] &= \frac{m}{k}-1 \end{align} Consequently, we define our third estimator to be $N_3 := m + G_{\text{avg}} = m + \left( \frac{m}{k} - 1\right)$. Clearly, $N_2, N_3 \geq m$, so these estimators cannot exhibit the undesirable quality of taking values less than $m$.

Furthermore, \begin{align} {\Bbb E}[N_3] &= {\Bbb E}\left[ m + \left( \frac{m}{k} - 1 \right) \right] \\[2mm] &= {\Bbb E}\left[ m \left( 1+k^{-1} \right) - 1 \right] \\[2mm] &= \left( 1+k^{-1} \right) {\Bbb E}\left[ m \right] - 1 \\[2mm] &= \left( 1+k^{-1} \right) \cdot \frac{k}{k+1}(N+1) - 1 \\[2mm] &= \frac{k+1}{k} \cdot \frac{k}{k+1}(N+1) - 1 \\[2mm] &= N \end{align} so $N_3$ is an unbiased estimator. The proof that $N_2$ is also unbiased follows a similar line of reasoning.

### Variance comparison

Below is a comparison of the variances of the three unbiased estimators $N_1, N_2, N_3$ defined above. As I intend this post to be more practical than technical, I will avoid a lengthy computation involving binomial coefficients and simply state the variances below without proof.

Note: the Wikipedia article on the Discrete uniform distribution contains the full derivation of the variance formula for $N_3$; this derivation uses the formulas ${\rm Var}[N_3] = \frac{(k+1)^2}{k^2} {\rm Var}[m]$ and ${\rm Var}[m] = {\Bbb E}[m^2] - ({\Bbb E}[m])^2$. The computation of ${\Bbb E}[m^2]$ is, of course, the lengthy part but is ultimately similar to our derivation of ${\Bbb E}[m]$ above.

The variance formulas are:

• ${\rm Var}[N_1] = \frac{(k+2)}{3k} \frac{(N-k)(N+1)}{(k+2)}$
• ${\rm Var}[N_2] = \frac{2}{(k+1)} \frac{(N-k)(N+1)}{(k+2)}$
• ${\rm Var}[N_3] = \frac{1}{k} \frac{(N-k)(N+1)}{(k+2)}$
while the standard deviations, which have more recognizable scales (i.e. number of tanks instead of squared number of tanks), are the square roots of respective variances. Let's see how the estimators perform for various sample sizes:

We can see that $N_3$ has the lowest variance of the three for all sample sizes. In the legend, I have marked $N_3$ as "MVUE", which stands for minimum variance unbiased estimator. This means that $N_3$ actually has the lowest variance of any unbiased estimator for $N$, not just the three above.

In Part 3 of this post, I will introduce the mathematical heavy machinery (sorry for the pun) to prove that $N_3$ is indeed the MVUE.

### Sources

Wikipedia: German Tank Problem
Wikipedia: Discrete uniform distribution
Estimating the Size of a Population (Roger Johnson) - a paper introducing the unbiased estimators above.
An Empirical Approach to Economic Intelligence in World War II (Richard Ruggles and Henry Brodie, 1947) - the original paper describing the use of these statistical methods during WWII.

## Parameter Estimation - Part 1 (Reader Request)

Preliminaries: How do mathematicians model randomness?, Monte Carlo Simulation - Part 2, Proof of the Law of Large Numbers

The preliminary post How do mathematicians model randomness? introduced random variables, their probability distributions, and parameters thereof (namely, mean, variance, and standard deviation). This post, the response to a reader request from Anonymous, will cover estimation of parameters based on random sampling. I will explain the difference between parameters and statistics, introduce the concept of estimator bias, and address the reader request's specific question about unbiased estimation of standard deviation.

In Part 2 of this post, I will present a well known historical application of parameter estimation, the German Tank Problem, and compare methods of estimating an unknown population size. Finally, in Part 3, I will introduce complete and sufficient statistics, which allow us to prove that the best estimator among the candidates in Part 2 is the unique minimum variance unbiased estimator.

### Parameters and statistics

Recall from the first preliminary post that we model random phenomena with random variables and their probability distributions. Key characteristics of these distributions (mean, variance, standard deviation, etc.) are called parameters.

Parameters are defined based on all possible values of a random variable (the population) weighted by their relative frequencies of occurrence (i.e. their probabilities). For example, the mean (denoted $\mu_{X}$, ${\Bbb E}(X)$, or simply $\mu$ when there is no ambiguity as to the underlying random variable) of an outcome of interest (i.e. random variable) $X$ is defined as $$\mu = \sum_{i}{x_i {\Bbb P}(x_i)}$$ where each $x_i$ is one of the possible values of $X$ and the sum runs over all possible values (in the continuous case, the sum would be replaced by an integral).

In practice, we do not have access to all possible values of $X$ and their probabilities. Instead, we typically have a sample of observed values, $x_1, x_2, \dotsc, x_n$. Given such a sample, we could estimate $\mu$ using the sample mean $$\bar{x} = \frac{1}{n}\sum_{i=1}^{n}{x_i}$$ The sample mean is a statistic, a value calculated based on sample data, which can be used to estimate the (unknown) parameter value. Notice that $\bar{x}$ can take on different values depending on which random sample we used to calculate it. In other words, $\bar{x}$ is itself a random variable.

### Estimator bias, Bessel's correction

As random variables, statistics have their own probability distributions, known as sampling distributions, and thus their own means, variances, standard deviations, etc. We actually already touched upon this fact in the earlier posts Monte Carlo Simulation - Part 2 and Proof of the Law of Large Numbers, in which we proved that the sample mean $\bar{x}$ has expected value $\mu$ and variance $\frac{\sigma^2}{n}$ (a fact that we will use below).

Since ${\Bbb E}(\bar{x}) = \mu$, we say that the sample mean is an unbiased estimator of the population mean $\mu$. By the same logic, we may estimate the population variance $\sigma^2 = \sum_{i}{(x_i-\mu)^{2}{\Bbb P}(x_i)}$ using the statistic $$s^2_1= \frac{1}{n} \sum_{i=1}^{n}{(x_i-\bar{x})^2}$$ However, this statistic is not an unbiased estimator of $\sigma^2$. In order to see why this is the case, we can compute the expected value ${\Bbb E}(\sigma^2 - s_1^2)$, which would be zero if $s_1^2$ were unbiased: \begin{align} {\Bbb E} \left[ \sigma^2 - s_1^2 \right] &= {\Bbb E} \left[ \frac{1}{n}\sum_{i=1}^{n}{(x_i - \mu)^2} - \frac{1}{n}\sum_{i=1}^{n}{(x_i-\bar{x})^2}\right] \\[2mm] &= \frac{1}{n}{\Bbb E}\left[ \sum_{i=1}^{n}{\left( \left( x_i^2 - 2 x_i \mu + \mu^2) - (x_i^2 - 2 x_i \bar{x} + \bar{x}^2 \right) \right)} \right] \\[2mm] &= {\Bbb E}\left[ \mu^2 - \bar{x}^2 + \frac{1}{n}\sum_{i=1}^{n}{(2x_i (\bar{x}-\mu))} \right] \\[2mm] &= {\Bbb E}\left[ \mu^2 - \bar{x}^2 + 2\bar{x}(\bar{x}-\mu) \right] \\[2mm] &= {\Bbb E}\left[ \mu^2 - 2\bar{x}\mu + \bar{x}^2 \right] \\[2mm] &= {\Bbb E} \left[ (\bar{x} - \mu)^2 \right] \\[2mm] &= \rm{Var}(\bar{x}) \\[2mm] &= \frac{\sigma^2}{n} \end{align} Since ${\Bbb E}[\sigma^2 - s_1^2] = {\Bbb E}[\sigma^2] - {\Bbb E}[s_1^2] = \sigma^2 - {\Bbb E}[s_1^2]$, the above implies that $${\Bbb E}[s_1^2] = \sigma^2 - \frac{\sigma^2}{n} = \frac{n-1}{n}\sigma^2$$ Therefore, the statistic $s^2 = \frac{n}{n-1}s_1^2 = \frac{1}{n-1}\sum_{i=1}^{n}{(x_i-\bar{x})^2}$, known as the sample variance, has expected value $\sigma^2$ and is thus an unbiased estimator of the population variance.

The replacement of $\frac{1}{n}$ with $\frac{1}{n-1}$ in the sample variance formula is known as Bessel's correction. The derivation above shows that the bias in $s_1^2$ arises due to the fact that $(x_i-\bar{x})$ underestimates the actual quantity of interest, $(x_i-\mu)$, by $(\bar{x}-\mu)$ for each $x_i$. Therefore, the bias is the variance of $\bar{x}$, which we proved to be $\frac{\sigma^2}{n}$ in Proof of the Law of Large Numbers. Using $s^2$ instead of $s_1^2$ corrects for this bias.

### Estimation of the standard deviation

Given $s^2$ is an unbiased estimator of $\sigma^2$, we may expect that the sample standard deviation $s=\sqrt{s^2}$ would also be an unbiased estimator of the population standard deviation $\sigma$. However, \begin{align} {\Bbb E}\left[ s^2 \right] &= \sigma^2 \\[2mm] \Rightarrow {\Bbb E}\left[ \sqrt{s^2} \right] &< \sqrt{{\Bbb E}\left[ s^2 \right]} = \sqrt{\sigma^2} = \sigma \end{align} where the inequality follows from Jensen's inequality and the fact that the square root is a concave function (since the area above it is concave, not convex). In other words, $s$ underestimates $\sigma$ on average.

Unfortunately, for estimating the population standard deviation, there is no easy correction as there is for the variance. The size of the necessary correction depends on the distribution of the underlying random variable. For the Normal distribution, there is a complicated exact formula, but simply replacing the $n-1$ in the denominator with $n-1.5$ eliminates most of the bias (with the remaining bias decreasing with increasing sample size). A further adjustment is possible for other distributions and depends on the excess kurtosis, a measure of the "heavy-tailedness" of the distribution in excess of that of the Normal distribution.

While the specific corrections are beyond the scope of this post, for the brave, there is an entire Wikipedia article dedicated to exactly this topic.

### Other measures of estimator quality

Zero bias is certainly a desirable quality for a statistic to have, but an estimator's quality depends on more than just its expected value. A statistic's variance tells us how large of a spread (from its expected value) we may expect when calculating the statistic based on various samples. Just as a statistic with large bias is not particularly helpful, neither is one with no bias but a large variance.

The notion of consistency ties bias and variance together nicely: a consistent estimator is one which converges in probability to the population parameter. This means that, as $n \rightarrow \infty$, the probability of an error greater than some specified amount $\epsilon$ approaches zero. This further implies that both the bias and the variance tend to zero as the sample size grows.

For example, the (weak) law of large numbers implies that $\bar{x}$ is a consistent estimator of $\mu$, as $\lim_{n \rightarrow \infty}{{\Bbb P} \left[ \left| \bar{x} - \mu \right| \geq \epsilon \right]} = 0$ for any $\epsilon > 0$. Furthermore, $s_1^2$ and $s^2$ are both consistent estimators of $\sigma^2$, while $s$ is a consistent estimator of $\sigma$. These examples show that both biased and unbiased estimators can be consistent.

That will do it for Part 1 of this post. Thanks for reading, and look out for Parts 2 and 3, coming up soon. Thanks to Anonymous for the great reader request.

### Sources:

Wikipedia- unbiased estimation of standard deviation
Wikipedia - Bessel's correction
Quora post- estimator bias vs. variance