## Parameter Estimation - Part 2 (German Tank Problem)

Preliminaries: Parameter Estimation - Part 1 (Reader Request)

Part 1 of this post introduced the concept of calculating statistics based on sample data and using them to estimate the unknown values of population parameters. As an estimator's predictive quality depends on both its bias and its variance, unbiasedness and low variance are clearly desirable qualities. In this post, we will derive and compare various estimators in the context of a concrete example, the German Tank Problem.

### Historical background

During WWII, the Economic Warfare Division of the US Embassy in London utilized a statistical method to estimate the monthly production rate of German tanks (among other equipment). While standard intelligence estimates (based, for example, on widely divergent POW reports or projections based on pre-war production capacity) amounted to 1,400 tanks per month between June 1940 and September 1942, the statistical analysis yielded a far lower figure: 256 per month. After the war, captured government records revealed the actual figure to be 255.

This statistical analysis was based on serial number markings on parts of captured or destroyed tanks. For example, the serial numbers on various wheels yielded an estimate of the total number of wheel molds in use, and British road-wheel manufacturers provided data on the number of wheels that could be produced from a single mold. Thus, if there are 32 wheels per tank, the estimated number of molds in use, times the number of wheels produced per mold per month, divided by 32, would be an estimate for the number of tanks produced in a given month.

Analysis of the 64 wheels from two captured tanks produced an estimate of 270 tanks produced in February 1944, while government records indicated 276 tanks for that month. The statistical method was thus quite effective, even based on a relatively small number of captured tanks. The Allied statisticians also analyzed wheels, gearboxes, chassis, and engines in order to cross-check their results.

Ignoring the particulars of wheel molds and serial numbering conventions, the mathematical statement of the estimation problem now known as the German Tank Problem is quite simple: suppose we have a population of $N$ items with labels $1,2, \dotsc, N$. Given a random sample of $k$ items with labels $x_1, x_2, \dotsc, x_k$, how do we estimate the population size $N$?

We would like an estimator for $N$ which is unbiased and has low variance. Without loss of generality, we can assume that the labels are in order, i.e. $x_1 < x_2 < \dotsb < x_k$. Furthermore, we define the sample maximum $m := x_k$.

### Sample maximum: a crude estimator

One obvious candidate for an estimator of $N$ is simply $m$. However, common sense dictates that this estimate must be biased downward, as $m$ cannot exceed $N$. Thus, the expected value of $m$ should be less than or equal to $N$.

Indeed, the probability of obtaining a sample with maximum $m$ is the number of ways to choose a sample with maximum $m$ divided by the number of possible samples. To choose a sample with maximum $m$, we must choose $m$ (there is only one way to do this) as well as $k-1$ additional sample values from the set $\lbrace 1, 2, \dotsc, m-1 \rbrace$ in one of ${m-1} \choose {k-1}$ ways, so the probability is $${\Bbb P}\left[ x_k = m \right] = \frac{{m-1}\choose{k-1}}{{N}\choose{k}}$$ and the expected value of $m$ is: \begin{align} {\Bbb E}[m] &= \sum_{j=k}^{N}{j \frac{{j-1}\choose{k-1}}{{N}\choose{k}}} \\[2mm] &= \frac{1}{N \choose k} \sum_{j=k}^{N}{j \frac{(j-1)!}{((j-1)-(k-1))! (k-1)!}} \\[2mm] &= \frac{1}{(k-1)!{N \choose k}} \sum_{j=k}^{N}{\frac{j!}{(j-k)!}} \\[2mm] &= \frac{k!}{(k-1)!{N \choose k}} \sum_{j=k}^{N}{j \choose k} \\[2mm] &\buildrel{(*)} \over{=} k \frac{{N+1}\choose{k+1}}{N \choose k} \\[2mm] &= \frac{k}{k+1}(N+1) \end{align} which does not equal $N$, except when $k=N$. Thus, $m$ is a biased estimator of $N$. The bias does become small when the sample size is a significant proportion of $N$, but this is not usually the case in practice.

$(*)$ Note: this equality follows from the Hockey-stick Identity for binomial coefficients: $\sum_{j=k}^{N}{j \choose k} = {{N+1} \choose {k+1}}$

 Expected value of $m$ for various values of sample size $k$ (with $N=250$). Notice that the bias is sizable until $k$ is ~20-30, i.e. 8-12% of $N$.

### A few unbiased estimators

We can employ a few common sense observations to derive unbiased estimators of $N$.

Firstly, let $\psi$ be the middle value of the list $1, 2, \dotsc, N$, i.e. $\psi = \frac{1}{2}(N+1)$ (this formula works whether $N$ is odd or even). Then $N=2\psi-1$, and the sample mean $\overline{x} = \frac{1}{k}\sum_{i=1}^{k}{x_i}$ is an unbiased estimator of $\psi$ (you can prove this using induction, for example). This, in turn, implies that $N_1 := 2\overline{x}-1$ is an unbiased estimator of $N$.

The issue with $N_1$ is that it can be less than the sample maximum $m$. For example, if we have a sample of size $k=3$ with $x_1 = 2, x_2 = 3, x_3=10$, then $\overline{x}=5$, and $N_1=9$, but $m=10$. This is an undesirable estimator as we know for sure that $N \geq 10$.

A more sophisticated method for deriving estimates revolves around analysis of the expected "gaps" between observations in the sample. For example, we may expect that, on average, the sample observations $x_1, x_2, \dotsc, x_k$ will be distributed roughly evenly throughout the interval $1, 2, \dotsc, N$. Therefore, we expect the number of unobserved labels below $x_1$ (which is observable from our sample, as the lowest label is $1$) to be equal, on average, to the number of unobserved labels above $m=x_k$ (which is not observable, as $N$ is unknown). Thus, we define our second estimator to be $N_2 := m + (x_1 - 1)$.

We can further refine $N_2$ by replacing the "gap" $(x_1 - 1)$ with the average of all the gaps in our sample, thus more fully utilizing the sample data. The average gap size in the sample is \begin{align} G_{\text{avg}} &= \frac{1}{k} \left[ (x_1 - 1) + (x_2 - x_1 - 1) + \dotsb + (x_k - x_{k-1} - 1) \right] \\[2mm] &= \frac{1}{k} \left[ x_k - k \right] \\[2mm] &= \frac{m}{k}-1 \end{align} Consequently, we define our third estimator to be $N_3 := m + G_{\text{avg}} = m + \left( \frac{m}{k} - 1\right)$. Clearly, $N_2, N_3 \geq m$, so these estimators cannot exhibit the undesirable quality of taking values less than $m$.

Furthermore, \begin{align} {\Bbb E}[N_3] &= {\Bbb E}\left[ m + \left( \frac{m}{k} - 1 \right) \right] \\[2mm] &= {\Bbb E}\left[ m \left( 1+k^{-1} \right) - 1 \right] \\[2mm] &= \left( 1+k^{-1} \right) {\Bbb E}\left[ m \right] - 1 \\[2mm] &= \left( 1+k^{-1} \right) \cdot \frac{k}{k+1}(N+1) - 1 \\[2mm] &= \frac{k+1}{k} \cdot \frac{k}{k+1}(N+1) - 1 \\[2mm] &= N \end{align} so $N_3$ is an unbiased estimator. The proof that $N_2$ is also unbiased follows a similar line of reasoning.

### Variance comparison

Below is a comparison of the variances of the three unbiased estimators $N_1, N_2, N_3$ defined above. As I intend this post to be more practical than technical, I will avoid a lengthy computation involving binomial coefficients and simply state the variances below without proof.

Note: the Wikipedia article on the Discrete uniform distribution contains the full derivation of the variance formula for $N_3$; this derivation uses the formulas ${\rm Var}[N_3] = \frac{(k+1)^2}{k^2} {\rm Var}[m]$ and ${\rm Var}[m] = {\Bbb E}[m^2] - ({\Bbb E}[m])^2$. The computation of ${\Bbb E}[m^2]$ is, of course, the lengthy part but is ultimately similar to our derivation of ${\Bbb E}[m]$ above.

The variance formulas are:

• ${\rm Var}[N_1] = \frac{(k+2)}{3k} \frac{(N-k)(N+1)}{(k+2)}$
• ${\rm Var}[N_2] = \frac{2}{(k+1)} \frac{(N-k)(N+1)}{(k+2)}$
• ${\rm Var}[N_3] = \frac{1}{k} \frac{(N-k)(N+1)}{(k+2)}$
while the standard deviations, which have more recognizable scales (i.e. number of tanks instead of squared number of tanks), are the square roots of respective variances. Let's see how the estimators perform for various sample sizes:

We can see that $N_3$ has the lowest variance of the three for all sample sizes. In the legend, I have marked $N_3$ as "MVUE", which stands for minimum variance unbiased estimator. This means that $N_3$ actually has the lowest variance of any unbiased estimator for $N$, not just the three above.

In Part 3 of this post, I will introduce the mathematical heavy machinery (sorry for the pun) to prove that $N_3$ is indeed the MVUE.