Parameter Estimation - Part 2 (German Tank Problem)
Preliminaries: Parameter Estimation - Part 1 (Reader Request)
Part 1 of this post introduced the concept of calculating statistics based on sample data and using them to estimate the unknown values of population parameters. As an estimator's predictive quality depends on both its bias and its variance, unbiasedness and low variance are clearly desirable qualities. In this post, we will derive and compare various estimators in the context of a concrete example, the German Tank Problem.
During WWII, the Economic Warfare Division of the US Embassy in London utilized a statistical method to estimate the monthly production rate of German tanks (among other equipment). While standard intelligence estimates (based, for example, on widely divergent POW reports or projections based on pre-war production capacity) amounted to 1,400 tanks per month between June 1940 and September 1942, the statistical analysis yielded a far lower figure: 256 per month. After the war, captured government records revealed the actual figure to be 255.
This statistical analysis was based on serial number markings on parts of captured or destroyed tanks. For example, the serial numbers on various wheels yielded an estimate of the total number of wheel molds in use, and British road-wheel manufacturers provided data on the number of wheels that could be produced from a single mold. Thus, if there are 32 wheels per tank, the estimated number of molds in use, times the number of wheels produced per mold per month, divided by 32, would be an estimate for the number of tanks produced in a given month.
Analysis of the 64 wheels from two captured tanks produced an estimate of 270 tanks produced in February 1944, while government records indicated 276 tanks for that month. The statistical method was thus quite effective, even based on a relatively small number of captured tanks. The Allied statisticians also analyzed wheels, gearboxes, chassis, and engines in order to cross-check their results.
Ignoring the particulars of wheel molds and serial numbering conventions, the mathematical statement of the estimation problem now known as the German Tank Problem is quite simple: suppose we have a population of $N$ items with labels $1,2, \dotsc, N$. Given a random sample of $k$ items with labels $x_1, x_2, \dotsc, x_k$, how do we estimate the population size $N$?
We would like an estimator for $N$ which is unbiased and has low variance. Without loss of generality, we can assume that the labels are in order, i.e. $x_1 < x_2 < \dotsb < x_k$. Furthermore, we define the sample maximum $m := x_k$.
One obvious candidate for an estimator of $N$ is simply $m$. However, common sense dictates that this estimate must be biased downward, as $m$ cannot exceed $N$. Thus, the expected value of $m$ should be less than or equal to $N$.
Indeed, the probability of obtaining a sample with maximum $m$ is the number of ways to choose a sample with maximum $m$ divided by the number of possible samples. To choose a sample with maximum $m$, we must choose $m$ (there is only one way to do this) as well as $k-1$ additional sample values from the set $\lbrace 1, 2, \dotsc, m-1 \rbrace$ in one of ${m-1} \choose {k-1}$ ways, so the probability is $$
{\Bbb P}\left[ x_k = m \right] = \frac{{m-1}\choose{k-1}}{{N}\choose{k}}
$$ and the expected value of $m$ is: $$
\begin{align}
{\Bbb E}[m]
&= \sum_{j=k}^{N}{j \frac{{j-1}\choose{k-1}}{{N}\choose{k}}} \\[2mm]
&= \frac{1}{N \choose k} \sum_{j=k}^{N}{j \frac{(j-1)!}{((j-1)-(k-1))! (k-1)!}} \\[2mm]
&= \frac{1}{(k-1)!{N \choose k}} \sum_{j=k}^{N}{\frac{j!}{(j-k)!}} \\[2mm]
&= \frac{k!}{(k-1)!{N \choose k}} \sum_{j=k}^{N}{j \choose k} \\[2mm]
&\buildrel{(*)} \over{=} k \frac{{N+1}\choose{k+1}}{N \choose k} \\[2mm]
&= \frac{k}{k+1}(N+1)
\end{align}
$$ which does not equal $N$, except when $k=N$. Thus, $m$ is a biased estimator of $N$. The bias does become small when the sample size is a significant proportion of $N$, but this is not usually the case in practice.
$(*)$ Note: this equality follows from the Hockey-stick Identity for binomial coefficients: $\sum_{j=k}^{N}{j \choose k} = {{N+1} \choose {k+1}}$
We can employ a few common sense observations to derive unbiased estimators of $N$.
Firstly, let $\psi$ be the middle value of the list $1, 2, \dotsc, N$, i.e. $\psi = \frac{1}{2}(N+1)$ (this formula works whether $N$ is odd or even). Then $N=2\psi-1$, and the sample mean $\overline{x} = \frac{1}{k}\sum_{i=1}^{k}{x_i}$ is an unbiased estimator of $\psi$ (you can prove this using induction, for example). This, in turn, implies that $N_1 := 2\overline{x}-1$ is an unbiased estimator of $N$.
The issue with $N_1$ is that it can be less than the sample maximum $m$. For example, if we have a sample of size $k=3$ with $x_1 = 2, x_2 = 3, x_3=10$, then $\overline{x}=5$, and $N_1=9$, but $m=10$. This is an undesirable estimator as we know for sure that $N \geq 10$.
A more sophisticated method for deriving estimates revolves around analysis of the expected "gaps" between observations in the sample. For example, we may expect that, on average, the sample observations $x_1, x_2, \dotsc, x_k$ will be distributed roughly evenly throughout the interval $1, 2, \dotsc, N$. Therefore, we expect the number of unobserved labels below $x_1$ (which is observable from our sample, as the lowest label is $1$) to be equal, on average, to the number of unobserved labels above $m=x_k$ (which is not observable, as $N$ is unknown). Thus, we define our second estimator to be $N_2 := m + (x_1 - 1)$.
We can further refine $N_2$ by replacing the "gap" $(x_1 - 1)$ with the average of all the gaps in our sample, thus more fully utilizing the sample data. The average gap size in the sample is $$
\begin{align}
G_{\text{avg}}
&= \frac{1}{k} \left[ (x_1 - 1) + (x_2 - x_1 - 1)
+ \dotsb + (x_k - x_{k-1} - 1) \right] \\[2mm]
&= \frac{1}{k} \left[ x_k - k \right] \\[2mm]
&= \frac{m}{k}-1
\end{align}
$$ Consequently, we define our third estimator to be $N_3 := m + G_{\text{avg}} = m + \left( \frac{m}{k} - 1\right)$. Clearly, $N_2, N_3 \geq m$, so these estimators cannot exhibit the undesirable quality of taking values less than $m$.
Furthermore, $$
\begin{align}
{\Bbb E}[N_3]
&= {\Bbb E}\left[ m + \left( \frac{m}{k} - 1 \right) \right] \\[2mm]
&= {\Bbb E}\left[ m \left( 1+k^{-1} \right) - 1 \right] \\[2mm]
&= \left( 1+k^{-1} \right) {\Bbb E}\left[ m \right] - 1 \\[2mm]
&= \left( 1+k^{-1} \right) \cdot \frac{k}{k+1}(N+1) - 1 \\[2mm]
&= \frac{k+1}{k} \cdot \frac{k}{k+1}(N+1) - 1 \\[2mm]
&= N
\end{align}
$$ so $N_3$ is an unbiased estimator. The proof that $N_2$ is also unbiased follows a similar line of reasoning.
Below is a comparison of the variances of the three unbiased estimators $N_1, N_2, N_3$ defined above. As I intend this post to be more practical than technical, I will avoid a lengthy computation involving binomial coefficients and simply state the variances below without proof.
Note: the Wikipedia article on the Discrete uniform distribution contains the full derivation of the variance formula for $N_3$; this derivation uses the formulas ${\rm Var}[N_3] = \frac{(k+1)^2}{k^2} {\rm Var}[m]$ and ${\rm Var}[m] = {\Bbb E}[m^2] - ({\Bbb E}[m])^2$. The computation of ${\Bbb E}[m^2]$ is, of course, the lengthy part but is ultimately similar to our derivation of ${\Bbb E}[m]$ above.
The variance formulas are:
Part 1 of this post introduced the concept of calculating statistics based on sample data and using them to estimate the unknown values of population parameters. As an estimator's predictive quality depends on both its bias and its variance, unbiasedness and low variance are clearly desirable qualities. In this post, we will derive and compare various estimators in the context of a concrete example, the German Tank Problem.
Historical background
During WWII, the Economic Warfare Division of the US Embassy in London utilized a statistical method to estimate the monthly production rate of German tanks (among other equipment). While standard intelligence estimates (based, for example, on widely divergent POW reports or projections based on pre-war production capacity) amounted to 1,400 tanks per month between June 1940 and September 1942, the statistical analysis yielded a far lower figure: 256 per month. After the war, captured government records revealed the actual figure to be 255.
This statistical analysis was based on serial number markings on parts of captured or destroyed tanks. For example, the serial numbers on various wheels yielded an estimate of the total number of wheel molds in use, and British road-wheel manufacturers provided data on the number of wheels that could be produced from a single mold. Thus, if there are 32 wheels per tank, the estimated number of molds in use, times the number of wheels produced per mold per month, divided by 32, would be an estimate for the number of tanks produced in a given month.
Analysis of the 64 wheels from two captured tanks produced an estimate of 270 tanks produced in February 1944, while government records indicated 276 tanks for that month. The statistical method was thus quite effective, even based on a relatively small number of captured tanks. The Allied statisticians also analyzed wheels, gearboxes, chassis, and engines in order to cross-check their results.
Ignoring the particulars of wheel molds and serial numbering conventions, the mathematical statement of the estimation problem now known as the German Tank Problem is quite simple: suppose we have a population of $N$ items with labels $1,2, \dotsc, N$. Given a random sample of $k$ items with labels $x_1, x_2, \dotsc, x_k$, how do we estimate the population size $N$?
We would like an estimator for $N$ which is unbiased and has low variance. Without loss of generality, we can assume that the labels are in order, i.e. $x_1 < x_2 < \dotsb < x_k$. Furthermore, we define the sample maximum $m := x_k$.
Sample maximum: a crude estimator
One obvious candidate for an estimator of $N$ is simply $m$. However, common sense dictates that this estimate must be biased downward, as $m$ cannot exceed $N$. Thus, the expected value of $m$ should be less than or equal to $N$.
Indeed, the probability of obtaining a sample with maximum $m$ is the number of ways to choose a sample with maximum $m$ divided by the number of possible samples. To choose a sample with maximum $m$, we must choose $m$ (there is only one way to do this) as well as $k-1$ additional sample values from the set $\lbrace 1, 2, \dotsc, m-1 \rbrace$ in one of ${m-1} \choose {k-1}$ ways, so the probability is $$
{\Bbb P}\left[ x_k = m \right] = \frac{{m-1}\choose{k-1}}{{N}\choose{k}}
$$ and the expected value of $m$ is: $$
\begin{align}
{\Bbb E}[m]
&= \sum_{j=k}^{N}{j \frac{{j-1}\choose{k-1}}{{N}\choose{k}}} \\[2mm]
&= \frac{1}{N \choose k} \sum_{j=k}^{N}{j \frac{(j-1)!}{((j-1)-(k-1))! (k-1)!}} \\[2mm]
&= \frac{1}{(k-1)!{N \choose k}} \sum_{j=k}^{N}{\frac{j!}{(j-k)!}} \\[2mm]
&= \frac{k!}{(k-1)!{N \choose k}} \sum_{j=k}^{N}{j \choose k} \\[2mm]
&\buildrel{(*)} \over{=} k \frac{{N+1}\choose{k+1}}{N \choose k} \\[2mm]
&= \frac{k}{k+1}(N+1)
\end{align}
$$ which does not equal $N$, except when $k=N$. Thus, $m$ is a biased estimator of $N$. The bias does become small when the sample size is a significant proportion of $N$, but this is not usually the case in practice.
$(*)$ Note: this equality follows from the Hockey-stick Identity for binomial coefficients: $\sum_{j=k}^{N}{j \choose k} = {{N+1} \choose {k+1}}$
Expected value of $m$ for various values of sample size $k$ (with $N=250$). Notice that the bias is sizable until $k$ is ~20-30, i.e. 8-12% of $N$. |
A few unbiased estimators
We can employ a few common sense observations to derive unbiased estimators of $N$.
Firstly, let $\psi$ be the middle value of the list $1, 2, \dotsc, N$, i.e. $\psi = \frac{1}{2}(N+1)$ (this formula works whether $N$ is odd or even). Then $N=2\psi-1$, and the sample mean $\overline{x} = \frac{1}{k}\sum_{i=1}^{k}{x_i}$ is an unbiased estimator of $\psi$ (you can prove this using induction, for example). This, in turn, implies that $N_1 := 2\overline{x}-1$ is an unbiased estimator of $N$.
The issue with $N_1$ is that it can be less than the sample maximum $m$. For example, if we have a sample of size $k=3$ with $x_1 = 2, x_2 = 3, x_3=10$, then $\overline{x}=5$, and $N_1=9$, but $m=10$. This is an undesirable estimator as we know for sure that $N \geq 10$.
A more sophisticated method for deriving estimates revolves around analysis of the expected "gaps" between observations in the sample. For example, we may expect that, on average, the sample observations $x_1, x_2, \dotsc, x_k$ will be distributed roughly evenly throughout the interval $1, 2, \dotsc, N$. Therefore, we expect the number of unobserved labels below $x_1$ (which is observable from our sample, as the lowest label is $1$) to be equal, on average, to the number of unobserved labels above $m=x_k$ (which is not observable, as $N$ is unknown). Thus, we define our second estimator to be $N_2 := m + (x_1 - 1)$.
We can further refine $N_2$ by replacing the "gap" $(x_1 - 1)$ with the average of all the gaps in our sample, thus more fully utilizing the sample data. The average gap size in the sample is $$
\begin{align}
G_{\text{avg}}
&= \frac{1}{k} \left[ (x_1 - 1) + (x_2 - x_1 - 1)
+ \dotsb + (x_k - x_{k-1} - 1) \right] \\[2mm]
&= \frac{1}{k} \left[ x_k - k \right] \\[2mm]
&= \frac{m}{k}-1
\end{align}
$$ Consequently, we define our third estimator to be $N_3 := m + G_{\text{avg}} = m + \left( \frac{m}{k} - 1\right)$. Clearly, $N_2, N_3 \geq m$, so these estimators cannot exhibit the undesirable quality of taking values less than $m$.
Furthermore, $$
\begin{align}
{\Bbb E}[N_3]
&= {\Bbb E}\left[ m + \left( \frac{m}{k} - 1 \right) \right] \\[2mm]
&= {\Bbb E}\left[ m \left( 1+k^{-1} \right) - 1 \right] \\[2mm]
&= \left( 1+k^{-1} \right) {\Bbb E}\left[ m \right] - 1 \\[2mm]
&= \left( 1+k^{-1} \right) \cdot \frac{k}{k+1}(N+1) - 1 \\[2mm]
&= \frac{k+1}{k} \cdot \frac{k}{k+1}(N+1) - 1 \\[2mm]
&= N
\end{align}
$$ so $N_3$ is an unbiased estimator. The proof that $N_2$ is also unbiased follows a similar line of reasoning.
Variance comparison
Below is a comparison of the variances of the three unbiased estimators $N_1, N_2, N_3$ defined above. As I intend this post to be more practical than technical, I will avoid a lengthy computation involving binomial coefficients and simply state the variances below without proof.
Note: the Wikipedia article on the Discrete uniform distribution contains the full derivation of the variance formula for $N_3$; this derivation uses the formulas ${\rm Var}[N_3] = \frac{(k+1)^2}{k^2} {\rm Var}[m]$ and ${\rm Var}[m] = {\Bbb E}[m^2] - ({\Bbb E}[m])^2$. The computation of ${\Bbb E}[m^2]$ is, of course, the lengthy part but is ultimately similar to our derivation of ${\Bbb E}[m]$ above.
The variance formulas are:
- ${\rm Var}[N_1] = \frac{(k+2)}{3k} \frac{(N-k)(N+1)}{(k+2)}$
- ${\rm Var}[N_2] = \frac{2}{(k+1)} \frac{(N-k)(N+1)}{(k+2)}$
- ${\rm Var}[N_3] = \frac{1}{k} \frac{(N-k)(N+1)}{(k+2)}$
We can see that $N_3$ has the lowest variance of the three for all sample sizes. In the legend, I have marked $N_3$ as "MVUE", which stands for minimum variance unbiased estimator. This means that $N_3$ actually has the lowest variance of any unbiased estimator for $N$, not just the three above.
In Part 3 of this post, I will introduce the mathematical heavy machinery (sorry for the pun) to prove that $N_3$ is indeed the MVUE.
Thanks for reading.
Wikipedia: German Tank Problem
Wikipedia: Discrete uniform distribution
Estimating the Size of a Population (Roger Johnson) - a paper introducing the unbiased estimators above.
An Empirical Approach to Economic Intelligence in World War II (Richard Ruggles and Henry Brodie, 1947) - the original paper describing the use of these statistical methods during WWII.
Sources
Wikipedia: German Tank Problem
Wikipedia: Discrete uniform distribution
Estimating the Size of a Population (Roger Johnson) - a paper introducing the unbiased estimators above.
An Empirical Approach to Economic Intelligence in World War II (Richard Ruggles and Henry Brodie, 1947) - the original paper describing the use of these statistical methods during WWII.