Powered by Blogger.
Showing posts with label Statistics. Show all posts
Showing posts with label Statistics. Show all posts

Parameter Estimation - Part 3 (Minimum-Variance Unbiased Estimator)

Preliminaries: How do mathematicians model randomness?Parameter Estimation - Part 1 (Reader Request)Parameter Estimation - Part 2 (German Tank Problem)

First off, a housekeeping note: apologies to all for the long delay between the last post and this one (over a year...oops!). Each post does take me a while behind the scenes, and work plus other non-math activities (mainly history stuff) partly deprived me of the time I needed to put this together. That being said, I will try to keep that gap shorter going forward. Now, on to the actual content.

In Part 2, we examined various unbiased estimators of the size of a population based on a random sample of serial numbers, concluding that the estimator $N_3 = m + (\frac{m}{k} - 1)$ (where $x_k = m$ is the maximum serial number observed in the sample, and $k$ is the sample size) had lower variance than the others. I also made the claim that this estimator has the lowest variance of any unbiased estimator of the population size. In this Part 3, I will introduce the concepts of completeness and sufficiency of an estimator, as well as the theorems that allow us to use these properties to find the minimum-variance unbiased estimator (and prove it is unique).


Statistics/parameters recap; some new notation


We saw in Parts 1 and 2 that we can use an observable statistic or estimator, associated with a sample, to estimate an unobservable parameter, associated with the entire population. For example, we used formulas based on the sample maximum, $x_k = m$, to estimate the population size/maximum, $N$ in the German Tank Problem. The values of the statistic have a probability distribution, called the sampling distribution of the statistic, as different samples of size $k$ will yield different values of the estimator.

If the mean (a.k.a. expected value) of the sampling distribution of a statistic equals the value of the parameter, then the statistic is unbiased; otherwise, it's biased. Furthermore, there is also a variance associated with the estimation. Now, there can be a trade-off between bias and variance (i.e. we could accept an estimator we know to be slightly biased for a significant reduction of variance vs. an unbiased alternative), a concept I did not delve into in the past posts. Among unbiased estimators though, we would prefer one with as little variance as possible. As the title suggests, we'll continue to focus only on unbiased estimators in this post.

In Part 2, we saw that the sampling distribution of the sample maximum was $$
\Bbb{P}[x_k = m] = f(m) = \frac{m-1 \choose k-1}{N \choose k}
$$ which obviously depends on the value of the parameter, $N$. The probability distribution would be a different one for a different value of the parameter. Thus, we have here a statistical model for randomly choosing a sample of $k$ serial numbers from a population of size $N$ (and from a discrete uniform distribution over $\lbrace 1, 2, \dots , N \rbrace$).

The definitions in this post will require us to be a bit more rigorous than before, thinking of our statistical model as a family of probability distributions $\Bbb{P}_\theta$, parameterized, or indexed, by the value of a parameter $\theta$ (a common letter to use for a parameter in this context). In this post, I will use the notations $\Bbb{P}_\theta$ and $\Bbb{E}_\theta$ to denote the probability and expected value, respectively, for a particular value, $\theta$, of the relevant parameter within the model in question. I'll call the set of possible parameter values $\Theta$, and thus I'll refer to the model itself as $\scr{P} = \lbrace {\Bbb P}_\theta \, | \, \theta \in \Theta \rbrace$.

We'll also need the concept of conditional probability. The probability that $A$ will occur, given that $B$ has occurred, is denoted ${\Bbb P}[A|B]$ (read "probability of $A$, given $B$"). In general, the probability that $A$ and $B$ occur is ${\Bbb P}[A \cap B] = {\Bbb P}[A|B]{\Bbb P}[B] \, (\clubsuit)$. We can see that this equality is true by rearranging as ${\Bbb P}[A|B]={\Bbb P}[A \cap B] / {\Bbb P}[B]$: the probability of $A$, given $B$, is the probability of $A$'s intersection with $B$, divided by the probability of the entire $B$, i.e. the cases of interest divided by the total number of possible (given $B$) cases.

Finally, we'll need the law of total expectation: ${\Bbb E}[{\Bbb E}[A | B]]={\Bbb E}[A]$. Note that ${\Bbb E}[A|B]$ is a function of $B$, $f(b)={\Bbb E}[A \, | \, B=b]$; the outer expectation then averages over the possible values $b$ of $B$.

That the law of total expectation is true, is illustrated by a simple example from the Wikipedia article (I will include the link in Sources below): if factory $X$ makes 60% of light bulbs at your hardware store, and $X$'s light bulbs last 5,000 hours on average, and factory $Y$ makes the other 40%, with $Y$'s bulbs lasting 4,000 hours on average, then the expected lifetime $L$ of a random bulb purchased would be: $$
{\Bbb E}[L] =
\underbrace{5{,}000 \times .6}_{{\Bbb E}[L|X] \times {\Bbb P}[X]}
+
\underbrace{4{,}000 \times .4}_{{\Bbb E}[L|Y] \times {\Bbb P}[Y]}
= 4{,}600
$$ The formal proof of the law of total expectation just involves manipulations of a double-sum, so I won't include it here.


Complete statistics


Let $X$ be a random variable within a parametric model $\scr{P}$, and let ${\mathbf x} = (x_1, x_2, \dotsc, x_k)$ be a sample of values of $X$. Let $T$ be a statistic, i.e. a function of $\mathbf x$ (thus, below, $T$ will be understood to mean $T({\mathbf x})$).

The statistic $T$ is called complete if, for any (measurable*) function $g$, $$
{\Bbb E}_\theta \left[ g(T) \right] = 0, \forall \theta \in \Theta
\implies
{\Bbb P}_\theta \left[ g(T)=0 \right] = 1, \forall \theta \in \Theta
$$ This is a bulky definition, so let's unpack it. The $\implies$ arrow means that if the statement on the left of the arrow is true, then the statement on the right of the arrow is also true.

The statement on the left is saying that, for every value $\theta$ of the parameter, the expected value of $g(T)$ (under the specific version of the model with $\theta$ as the parameter) is zero, for a (measurable) function $g$.

The statement on the right is saying that, for every value $\theta$ of the parameter, the probability (under the specific version of the model with $\theta$ as the parameter) that $g(T)=0$ is 1, which means that $g$ is the zero function almost everywhere. This means that the set $\lbrace {\bf x} \, | \, g(T({\bf x})) \neq 0 \rbrace$ has probability zero (though it is not necessarily the empty set, an important distinction). When this is the case, $g$ is, for probabilistic purposes, essentially the same as the zero function.

*Note: "measurable" has a specific meaning beyond the scope of this post, but this basically just serves to rule out ridiculous functions. For all intents and purposes, any function you can think of is measurable.

$g(T)$ in this definition should be thought of as an estimator (for some parameter) based on the statistic $T$. In the German Tank Problem example, $T$ could be the max function $T({\mathbf x})=\max (x_i) =m$, and $g(T)$ would be some function of that. If ${\Bbb E}_\theta \left[ g(T) \right] = 0$, we call $g(T)$ an unbiased estimator of zero.

So this definition of completeness is really saying that, $T$ is a complete statistic if the only way to obtain an unbiased estimator of zero from $T$, is to apply an (almost-everywhere) zero function to $T$. Equivalently, any estimator formed by application of a not-a.e. zero function to the complete statistic $T$ will not result in an unbiased estimator of zero.

Note that whether a statistic is complete or not depends on the model: the "$\forall \theta \in \Theta$" part of the definition means that completeness may depend on which values of $\theta$ are, or are not, permitted in the model. For this reason, it is technically more correct to say that the family of probability distributions ${\scr F}_T = \lbrace f_T(t|\theta) \, | \, \theta \in \Theta \rbrace$ is complete, where the $f_T(t| \theta)$ are of course the probability distributions of $T$ for different values of the parameter. However, it is also common to simply say that "the statistic $T$ is complete," and I will use the simpler terminology for the remainder of this post. It is not too difficult to come up with examples in which placing a restriction on $\Theta$ changes the completeness status of a statistic; I'll provide links in Sources that include a few such examples, but in the interest of post length, I won't go into one here.

This rather technical definition of completeness serves an important purpose. As the following Proposition shows, the property of completeness, as defined, guarantees the uniqueness of unbiased estimators.

Proposition 1: If a statistic $T$ is complete, then any unbiased estimator of a parameter $\theta$, based on $T$ (i.e. any estimator $g(T)$ that is a function of $T$), is essentially unique.

More precisely, the term essentially unique means that, if there is another function $h(T)$ that is an unbiased estimator of $\theta$, then $h(T)=g(T)$ almost everywhere. In other words, the set of values of $T$, for which $h \neq g$, has probability zero: ${\Bbb P}\left[ \lbrace t=T({\bf x}) \, | \, h(t) \neq g(t) \rbrace \right]=0$.

Proof: Suppose $g_1(T)$ and $g_2(T)$ are two different estimators based on $T$ that both have expected value $\theta$, i.e. both are unbiased estimators of a parameter $\theta$. Then the function $g(T)=g_1(T)-g_2(T)$ is an unbiased estimator of zero: $$
{\Bbb E}[g(T)] = {\Bbb E}[g_1(T)] - {\Bbb E}[g_2(T)] = \theta - \theta = 0
$$ Since the assumption was that $g = g_1 - g_2$ was not the (a.e.) zero function, this would contradict the fact that $T$ is complete, so we must instead have $g_1(T)=g_2(T)$ almost everywhere. Thus, the unbiased estimator of $\theta$ based on $T$ is essentially unique. $\square$


The sample max in the German Tank Problem is a complete statistic


Because $\Bbb{P}_\theta [x_k = m] = {m-1 \choose k-1}/{\theta \choose k}$ (here, we've replaced the population maximum $N$ with the letter $\theta$ in keeping with the notation from earlier in this post), we also know that, for a function $g$, and using $T$ as the sample maximum statistic: $$
{\Bbb E}_\theta[g(T)] = \sum_{j=1}^{\theta}{g(j) {\Bbb P}_\theta[T=j]} = \sum_{j=1}^{\theta}{g(j) \frac{j-1 \choose k-1}{\theta \choose k}}
$$ Suppose that the above expected value is $0$ for all permitted values of the parameter $\theta$, which, in this model, are $\lbrace 1, 2, 3, \dotsc \rbrace$. Then, in particular, it is $0$ for $\theta = 1$. In that simple case, the sum has only one term, $j=1$, and since the fraction becomes ${0 \choose 0} / {1 \choose 1}=1/1=1$, we must have $g(1)=0$ in order for the whole sum to equal zero.

The expectation must also be $0$ for the next case, $\theta=2$. Plugging in $g(1)=0$ from above, we are left with only one term in the two-term sum. Since the fraction of binomial coefficients is again non-zero for either $k=1$ or $k=2$, we must have $g(2)=0$ by the same reasoning as above.

Continuing with this inductive argument, we obtain $g(j)=0$ for all values of $j$, i.e. $g$ must be the zero function in order to have ${\Bbb E}_\theta[g(T)]=0, \forall \theta \in \Theta$. This proves that the sample maximum $T$ is a complete statistic with respect to the model ${\scr P}$.


Sufficient statistics and the Factorization Theorem


Proposition 1 suggests that, if we can somehow use a complete statistic $T$ to derive a minimum-variance unbiased estimator, then that estimator will be unique. Sufficiency is the property we need in order to actually find an unbiased estimator with minimum variance.

An estimator has variance because different samples yield different values of the estimator. Furthermore, some information in the samples is relevant to the estimation of the parameter, while other information is not. For example, from Part 2, while common sense dictates that the sample maximum in the German Tank Problem should certainly be relevant in estimating the population size, the estimator $N_1=2{\bar x}-1$ is based on the sample mean. Similarly, estimator $N_2=m+(x_1-1)$, though doing better since it's based on $m$, includes a "gap" adjustment based on the lowest serial number in the sample. These facts about the sample are "noise": they don't help us estimate $N$ more accurately.

Indeed, for both $N_1$ and $N_2$, different samples, all of size $k$ and with the same sample maximum $m$, can yield different estimator values. $N_3 = m + (\frac{m}{k}-1)$, on the other hand, depends only on the sample max $m$ and sample size $k$ (recall from Part 2 that the "average gap" adjustment simplified down to depend only on these). While different samples will have different values of $m$, yielding some estimator variance that is unavoidable, if we are given a certain value of $m$, we can have only one value of the estimator $N_3$. Thus, $N_3$ has less "avoidable noise" than the other estimators and should be the least volatile.

The discussion above is the intuition behind the following definition: a statistic $T({\bf x})$ is called sufficient (for the parameter $\theta$) if the conditional probability ${\Bbb P}_\theta[X={\bf x} \, | \, T({\bf x})=t]$ is a function of the value $t$ of the statistic and does not depend on $\theta$.

If $T$ is a sufficient statistic, then the probability that the sample takes on a particular value ${\bf x}$ depends on $\theta$ only through the value $t$ of the statistic. Flipping this around, when we estimate the value of the unknown parameter $\theta$ based on the sample ${\bf x}$, then the value $t=T({\bf x})$ of the statistic already contains all the information from the sample that is relevant to estimating $\theta$. For this reason, a sufficient statistic $T$ is sometimes referred to as a lossless compression of the sample ${\bf x}$.

The following theorem characterizes sufficient statistics.

Fisher-Neyman Factorization Theorem: Let $f_X({\bf x}|\theta)$ be the probability distribution of $X$, given the parameter value $\theta$. A statistic $T(X)$ is sufficient for the parameter $\theta$ if and only if there exist functions $g$ and $h$ such that $$
f_X({\bf x}|\theta) = h({\bf x})g(\theta, T({\bf x}))
$$ In other words, the probability density function of $X$ can be factored into a product such that one factor, $h$, does not depend on $\theta$, and the other factor, which does depend on $\theta$, depends on the sample ${\bf x}$ only through the statistic $t=T({\bf x})$.

This theorem makes some intuitive sense given the discussion above; a thorough reading of the proof helped me understand the symbols in terms of this intuition (and vice-versa). Note that this proof only works for the case of discrete random variables, but the theorem also applies in the case of continuous r.v.'s.

Proof:
Sufficiency $\implies$ factorization:
First, note the trivial fact that, since $T$ is a function of $X$, ($T=T({\bf x})$ and $X={\bf x}$) if and only if $X={\bf x}$. This means that ${\Bbb P}_\theta [X={\bf x}]={\Bbb P}_\theta [X={\bf x} \cap T=T({\bf x})]$, or in terms of the probability mass functions, $f_X({\bf x}|\theta)=f_{X,T(X)}({\bf x},T({\bf x})|\theta)$. Therefore: $$
\begin{align}

f_X({\bf x}|\theta)

&= f_{X,T(X)}({\bf x},T({\bf x})|\theta) \\[2mm]

&\stackrel{(\clubsuit)}{=}
\underbrace{f_{X|T(X)}({\bf x}|T({\bf x}),\theta) }_{h({\bf x})}
\cdot
\underbrace{f_{T({\bf x})}(T({\bf x})|\theta)}_{g(\theta,T({\bf x}))}

\end{align}
$$ The second equality is just equation $(\clubsuit)$ from the intro, written in terms of the pmf's rather than the probabilities. Since $T$ is sufficient, the conditional pmf $f_{X|T(X)}({\bf x}|T({\bf x}))$ is a function of $t=T({\bf x})$ (which, in turn, is a function of ${\bf x}$) and does not depend on $\theta$, so we can write it as $h({\bf x})$. The second pmf, $f_{T({\bf x})}(T({\bf x})|\theta)$, depends only on $t=T({\bf x})$ and $\theta$, so we can write it as $g(\theta,T({\bf x}))$, yielding the desired factorization formula.

Factorization $\implies$ sufficiency:
We turn again to the pmf version of $(\clubsuit)$: $$
f_{X|T(X)}({\bf x}|t,\theta)=\frac{f_{X,T(X)}({\bf x},t|\theta)}{f_{T(X)}(t|\theta)} \, \, \, \, (1)
$$ The numerator on the right-hand side is equal to $f_X({\bf x}|\theta)$ if $t=T({\bf x})$, and zero if $t \neq T({\bf x})$, again since $T$ is a function of $X$. Assuming the factorization formula, $f_X({\bf x}|\theta) \stackrel{(2)}{=} h({\bf x})g(\theta,t)$ when $t=T({\bf x})$.

For the denominator, since $T$ is a function of $X$, we have ${\Bbb P}_\theta[T=t]=\sum{\Bbb P}_\theta[X={\bf \tilde x}]$, where the sum is taken over the values of the dummy variable ${\bf \tilde x}$ for which $T({\bf \tilde x})=t$. Thus: $$
f_{T(X)}(t|\theta)
= \sum_{{\bf \tilde x}: T({\bf \tilde x})=t}{f_X({\bf \tilde x}|\theta)}
\stackrel{(3)}{=} \sum_{{\bf \tilde x}: T({\bf \tilde x})=t}{\left[ h({\bf \tilde x})g(\theta,t) \right]}
$$ where the first equality is the pmf version of the probability statement above, and the second equality follows from the assumption of the factorization formula for $f_X({\bf x}|\theta)$.

Plugging $(2)$ and $(3)$ into $(1)$ yields: $$
\begin{align}
f_{X|T(X)}({\bf x}|t,\theta)

&=\frac
{h({\bf x})g(\theta,t)}
{\sum_{{\bf \tilde x}:T({\bf \tilde x})=t}{\left[ h({\bf \tilde x})g(\theta,t) \right]}} \\[2mm]

&=\frac
{h({\bf x})g(\theta,t)}
{\sum_{{\bf \tilde x}:T({\bf \tilde x})=t}{\left[ h({\bf \tilde x}) \right]g(\theta,t)}}

=\frac
{h({\bf x})}
{\sum_{{\bf \tilde x}:T({\bf \tilde x})=t}{\left[ h({\bf \tilde x}) \right]}}
\end{align}
$$ where we took the $g(\theta,t)$ out of the sum in the second equality since it doesn't depend on ${\bf \tilde x}$. Since the final expression does not depend on $\theta$, we have proved that $T$ is sufficient.
$\square$


The sample max in the German Tank Problem is a sufficient statistic


To show that the sample max is a sufficient statistic, we can use the Factorization Theorem above. Let ${\bf 1}_A$ be the function that equals $1$ if $A$ is true and $0$ otherwise (this is called the indicator function of event $A$), and, for a sample ${\bf x}=(x_1,x_2,\dotsc,x_k)$, let the statistic $T({\bf x})=\max(x_i)$, the sample max.

The probability mass function for the sample $X$ taking the value ${\bf x}=(x_1, x_2, \dotsc, x_k)$, given the population size is $\theta$, is: $$
\begin{align}
f_X({\bf x} \, | \, \theta)

&= \left( \frac{1}{\theta}{\bf 1}_{\lbrace 1 \leq x_1 \leq \theta \rbrace} \right)
\left( \frac{1}{\theta}{\bf 1}_{\lbrace 1 \leq x_2 \leq \theta \rbrace} \right)
\dotsb
\left( \frac{1}{\theta}{\bf 1}_{\lbrace 1 \leq x_k \leq \theta \rbrace} \right) \\[2mm]

&= \frac{1}{\theta^n} {\bf 1}_{\lbrace 1 \leq \min(x_i) \rbrace} {\bf 1}_{\lbrace \max(x_i) \leq \theta \rbrace} \\[2mm]

&= \underbrace{
\vphantom{\frac{1}{\theta^n}}
{\bf 1}_{\lbrace 1 \leq \min(x_i) \rbrace}
}_{h({\bf x})}
\,
\underbrace{\frac{1}{\theta^n} {\bf 1}_{\lbrace T({\bf x}) \leq \theta \rbrace}}_{g(\theta,T({\bf x}))}

\end{align}
$$ Thus, by the Factorization Theorem, the sample max $T(X)$ is a sufficient statistic for the population size $\theta$.


The Rao-Blackwell and Lehmann-Scheffé Theorems


The discussion above suggests that completeness and sufficiency should help us obtain a unique estimator with minimal variance. The Rao-Blackwell and Lehmann-Scheffé Theorems prove this definitively. The first of these two gives us a process to reduce the variance of an estimator:

Clockwise from top-left: Rao, Blackwell, Scheffé, Lehmann
All four images courtesy of Wikipedia.

Rao-Blackwell Theorem:
Let $U=u(X)$ be an unbiased estimator of $\theta$, and let $T=t(X)$ be a sufficient statistic for $\theta$. Then the Rao-Blackwell estimator $$
U^*=u^*({\bf x}) = {\Bbb E}_{\theta}[U \, | \, T=t({\bf x})]
$$ is also unbiased, and its variance is less than that of $U$.

In the proof below, all ${\Bbb E}$'s are understood to stand for ${\Bbb E}_{\theta}$.

Proof: By the definition of sufficiency given above, $u^*({\bf x})$ is indeed a function of ${\bf x}$, since the conditional expectation of $U$, given the sufficient statistic $T$, does not depend on $\theta$. Also, $U^*$ is unbiased by the law of total expectation and the fact that $U$ is unbiased: $$
{\Bbb E}[U^*] = {\Bbb E}[{\Bbb E}[U|T]] = {\Bbb E}[U] = \theta
$$ Furthermore, the variance of $U^*$ is less than that of $U$: $$
\begin{align}
{\rm var}[U^*]

&= {\Bbb E}\left[ \left( U^* - {\Bbb E}[U^*] \vphantom{{\Bbb E}[U^*]^2} \right)^2 \right]  \\[2mm]

&= {\Bbb E}\left[ \left( {\Bbb E}[U|T] - \theta \vphantom{{\Bbb E}[U^*]^2} \right)^2 \right] \\[2mm]

&= {\Bbb E}\left[ \left( {\Bbb E}[U-  \theta \, | \, T] \vphantom{{\Bbb E}[U^*]^2} \right)^2 \right] \tag{3}\\[2mm]


&\leq {\Bbb E}\left[ {\Bbb E}\left[ \left( U-  \theta \vphantom{{\Bbb E}[U^*]^2} \right)^2 \, \middle| \, T \right] \right] \tag{4} \\[2mm]

&= {\Bbb E}\left[ \left( U-  \theta \vphantom{{\Bbb E}[U^*]^2} \right)^2 \right] \tag{5} \\[2mm]

&= {\Bbb E}\left[ \left( U-  {\Bbb E}[U] \vphantom{{\Bbb E}[U^*]^2} \right)^2 \right] \\[2mm]

&= {\rm var}[U]

\end{align}
$$ The third equality is true because $\theta$ is a constant (and ${\Bbb E}$ is linear), the inequality in line 4 is due to Jensen's inequality, and the equality in line 5 is the law of total expectation. $\square$

Rao-Blackwell is telling us that if we supply an unbiased estimator, we can condition on a sufficient statistic to obtain a new unbiased estimator with lower variance. This sounds like a good process to source MVUE candidates. The Lehmann-Scheffé Theorem is the last piece of the puzzle, and we need one more Proposition to prove it.

Proposition 2: If an MVUE exists, it is essentially unique. That is, if $U$ and $U'$ are two MVUE's, then $U=U'$ almost everywhere.

Proof: For random variables $X$ and $Y$, and constants $c, c_1, c_2$, we have the following variance and covariance formulas (which I won't derive in detail here): $$
\begin{align}
{\rm var}[X+Y] &= {\rm var}[X]+{\rm var}[Y] + 2{\rm cov}(X,Y) \tag{1}\\[2mm]
{\rm var}[cY] &= c^2 {\rm var}[Y] \tag{2} \\[2mm]
{\rm cov}(c_1 X,c_2 Y) &= c_1 c_2 {\rm cov}(X,Y) \tag{3}\\
\end{align}
$$ So, let $U$ be an MVUE and $U'$ another MVUE. Since $U$ and $U'$ both have expected value $\theta$, their difference has expected value zero. Thus, if we can show that ${\rm var}[U'-U]=0$, that would mean that $U'-U$ is a constant almost everywhere, and thus $U'=U$ almost everywhere. Using the formulas above, for any constant $c$, we have: $$
{\rm var}[U+c(U'-U)] = {\rm var}[U] + c^2 {\rm var}[U'-U] + 2c \, {\rm cov}(U, U'-U) \, \,(\spadesuit)
$$ Assume ${\rm var}[U'-U] = \gamma >0$ (we will show this leads to a contradiction), and let $\rho = -{\rm cov}(U,U'-U)$. Equation $(\spadesuit)$ must be true for any value of $c$, so let's plug in $c=\rho / \gamma$: $$
\begin{align}
{\rm var}[U+c(U'-U)]

&= {\rm var}[U] + \left( \frac{\rho}{\gamma} \right)^2 \gamma -2\left( \frac{\rho}{\gamma} \right) \rho \\[2mm]

&= {\rm var}[U] - \rho^2 / \gamma
\end{align}
$$ Since $U$ is an MVUE, the variance of $U+c(U'-U)$ can't be less than that of $U$. Thus, since $\rho^2 / \gamma \geq 0$, we must have $\rho^2 / \gamma = 0 \implies \rho = -{\rm cov}(U,U'-U)=0$.

Plugging this into $(\spadesuit)$, and now using the value $c=1$, we obtain: $$
\begin{align}
{\rm var}[U+1 \cdot (U'-U)] &= {\rm var}[U] + 1^2 \cdot {\rm var}[U'-U] + 0 \\[2mm]
{\rm var}[U'] &= {\rm var}[U] + \gamma
\end{align}
$$ But $U'$ is an MVUE, so its variance cannot be greater than that of $U$, and so $\gamma = 0$. This is a contradiction, so we must have ${\rm var}[U'-U]=0$, and thus $U'=U$ almost everywhere. $\square$

The proof of the following theorem will now be nice and easy, based on the discussion above.

Lehmann-Scheffé Theorem: If $T$ is a complete and sufficient statistic for a parameter $\theta$, and $U$ is an unbiased estimator of $\theta$, then $U^*={\Bbb E}[U|T]$ is the unique MVUE for $\theta$.

Proof: Firstly, let's break down the notation. Recall from the intro section that $U^*={\Bbb E}[U|T]$ is indeed a function of the value of $T$. Indeed, if ${\bf x}$ is the value of our sample, then the statistic $T$ has the value $t=T({\bf x})$, and $U^*({\bf x}) = g(t) = {\Bbb E}[U \, | \, T({\bf x})=t]$. Recall that we can write this in the simpler form $U^*=g(T)$, with the dependence on the sample ${\bf x}$ implied.

By Rao-Blackwell, we know that (1) $U^*$ is unbiased (since $U$ is unbiased, and $T$ is sufficient), and (2) ${\rm var}[U^*] \leq {\rm var}[U]$. Thus, $U^*$ is an MVUE candidate. Now, let $V$ be another MVUE candidate. We will prove that $V=U^*$ almost everywhere.

Define $V^* = {\Bbb E}[V|T] := h(T)$. Also by Rao-Blackwell, $V^*$ is unbiased and has variance less than or equal to that of $V$. But since $V$ is an MVUE, ${\rm var}[V^*] \nless {\rm var}[V]$, so we must have ${\rm var}[V^*]={\rm var}[V]$, and $V^*$ is also an MVUE. By Proposition 2, the MVUE is essentially unique, so $V^*=V$ almost everywhere.

Furthermore, by Proposition 1, any unbiased estimator that is a function of $T$ is essentially unique. Thus, the following is true (almost everywhere): $$
V = V^* = h(T) = g(T) = U^*
$$ Thus, $U^*$ is the unique MVUE of $\theta$. $\square$

All of the above theorems and propositions allow us to conclude that all of the following are true:

  • If we can find an unbiased estimator $U$ of $\theta$ as well as a complete and sufficient statistic $T$, then the Rao-Blackwell estimator $U^* = {\Bbb E}[U|T]$ is the MVUE of $\theta$ (Lehmann-Scheffé). The MVUE is a function of $T$.
  • This function of $T$ is essentially unique (Proposition 1).
  • An MVUE is essentially unique (Proposition 2). This means that, regardless of which complete and sufficient statistic $T$ we use to Rao-Blackwellize, we will obtain the same MVUE.
  • Similarly, we will obtain the same MVUE regardless of which unbiased estimator $U$ we start with.
The various sources listed below all come to these conclusions, often in different orders or stating some a bit differently. In Sources below, I include some commentary on the approaches of the respective authors.

Finally, turning back to the unbiased estimator $N_3$ from the German Tank Problem, it is now easy to see why it is the MVUE. We proved above that $T({\bf x})=\max(x_i)=m$ is complete and sufficient. As $N_3$ only depends on $m$, conditioning on $T$ obviously adds no additional information, so $N_3^* = {\Bbb E}[N_3 \, | \, T = m] = N_3$. By the Lehmann-Scheffé Theorem, $N_3$ is thus the unique MVUE for the population size $\theta$. An even simpler explanation is that $N_3$ is an unbiased function of the complete, sufficient statistic $T$. Since the MVUE is the unique unbiased function of $T$, $N_3$ is the MVUE.

This post was a bit more on the technical side, but I do hope folks enjoyed seeing the rigorous logic needed to prove that $N_3$ is the MVUE, as well as the heuristic observations / intuition relating to each step. I hope to have more content for you soon, and as always, I welcome any reader requests for topics. Until then, thanks for reading.


Sources


Wikipedia:
Law of Total Expectation
Rao-Blackwell Theorem
Lehmann-Scheffé Theorem
Note: I believe the proof of the Lehmann-Scheffé Theorem in the Wikipedia is missing one step: namely, it proves that the candidate MVUE is the unique function of the complete-sufficient statistic that could be the MVUE, but it never proves that the MVUE must indeed be a function of the complete-sufficient statistic. The other sources below, as well as the proof I included above, remedy this omission. If I'm missing something in the Wikipedia proof, I'd appreciate if someone would let me know in the comments.

FSU lecture notes (Debdeep Pati):
Complete Statistics
Note: These lecture notes contain numerous examples in which changing the parameter space $\Theta$ changes a statistic's completeness status. I promised above that I would include a source with such examples, since I did not include any in the post itself.

Stack Exchange:
A thread on the intuition behind the definition of completeness
Note: this thread also contains an example of a change in the completeness status of a statistic based on a change in $\Theta$.

OU lecture notes (Christian Remling):
Sufficiency
Completeness and Sufficiency
These lecture notes offer a thorough treatment of the topics with numerous examples other than just the German Tank Problem (e.g. the coin flip model). The notes also contain the continuous analog of the GTP, in which proving completeness and sufficiency of the sample max requires calculus.

Arizona lecture notes (Joe Watkins):
Sufficient Statistics
Another thorough treatment, I found Watkins's notation for the pmf's easiest to follow in these notes and thus adopted that notation throughout this post. I also followed his proof of the Fisher-Neyman Factorization Theorem, with a bit more explanation added in for my readers.

Oxford lecture notes (Steffen Lauritzen):
Sufficiency and Unbiased Estimation
In reaching the results of the Lehman-Scheffé Theorem, these notes squeeze as much as possible out of sufficiency before finally invoking completeness. I used Lauritzen's proofs of (1) the Rao-Blackwell Theorem, (2) the essential uniqueness of the MVUE, and (3) the fact that the MVUE must be a function of a sufficient statistic (the latter being rolled up into my proof of the Lehmann-Scheffé Theorem, which is the same way Remling handled it).

Parameter Estimation - Part 2 (German Tank Problem)

Preliminaries: Parameter Estimation - Part 1 (Reader Request)

Part 1 of this post introduced the concept of calculating statistics based on sample data and using them to estimate the unknown values of population parameters. As an estimator's predictive quality depends on both its bias and its variance, unbiasedness and low variance are clearly desirable qualities. In this post, we will derive and compare various estimators in the context of a concrete example, the German Tank Problem.


Historical background


During WWII, the Economic Warfare Division of the US Embassy in London utilized a statistical method to estimate the monthly production rate of German tanks (among other equipment). While standard intelligence estimates (based, for example, on widely divergent POW reports or projections based on pre-war production capacity) amounted to 1,400 tanks per month between June 1940 and September 1942, the statistical analysis yielded a far lower figure: 256 per month. After the war, captured government records revealed the actual figure to be 255.

This statistical analysis was based on serial number markings on parts of captured or destroyed tanks. For example, the serial numbers on various wheels yielded an estimate of the total number of wheel molds in use, and British road-wheel manufacturers provided data on the number of wheels that could be produced from a single mold. Thus, if there are 32 wheels per tank, the estimated number of molds in use, times the number of wheels produced per mold per month, divided by 32, would be an estimate for the number of tanks produced in a given month.

Analysis of the 64 wheels from two captured tanks produced an estimate of 270 tanks produced in February 1944, while government records indicated 276 tanks for that month. The statistical method was thus quite effective, even based on a relatively small number of captured tanks. The Allied statisticians also analyzed wheels, gearboxes, chassis, and engines in order to cross-check their results.

Ignoring the particulars of wheel molds and serial numbering conventions, the mathematical statement of the estimation problem now known as the German Tank Problem is quite simple: suppose we have a population of $N$ items with labels $1,2, \dotsc, N$. Given a random sample of $k$ items with labels $x_1, x_2, \dotsc, x_k$, how do we estimate the population size $N$?

We would like an estimator for $N$ which is unbiased and has low variance. Without loss of generality, we can assume that the labels are in order, i.e. $x_1 < x_2 < \dotsb < x_k$. Furthermore, we define the sample maximum $m := x_k$.


Sample maximum: a crude estimator


One obvious candidate for an estimator of $N$ is simply $m$. However, common sense dictates that this estimate must be biased downward, as $m$ cannot exceed $N$. Thus, the expected value of $m$ should be less than or equal to $N$.

Indeed, the probability of obtaining a sample with maximum $m$ is the number of ways to choose a sample with maximum $m$ divided by the number of possible samples. To choose a sample with maximum $m$, we must choose $m$ (there is only one way to do this) as well as $k-1$ additional sample values from the set $\lbrace 1, 2, \dotsc, m-1 \rbrace$ in one of ${m-1} \choose {k-1}$ ways, so the probability is $$
{\Bbb P}\left[ x_k = m \right] = \frac{{m-1}\choose{k-1}}{{N}\choose{k}}
$$ and the expected value of $m$ is: $$
\begin{align}
{\Bbb E}[m]
&= \sum_{j=k}^{N}{j \frac{{j-1}\choose{k-1}}{{N}\choose{k}}} \\[2mm]
&= \frac{1}{N \choose k} \sum_{j=k}^{N}{j \frac{(j-1)!}{((j-1)-(k-1))! (k-1)!}} \\[2mm]
&= \frac{1}{(k-1)!{N \choose k}} \sum_{j=k}^{N}{\frac{j!}{(j-k)!}} \\[2mm]
&= \frac{k!}{(k-1)!{N \choose k}} \sum_{j=k}^{N}{j \choose k} \\[2mm]
&\buildrel{(*)} \over{=} k \frac{{N+1}\choose{k+1}}{N \choose k} \\[2mm]
&= \frac{k}{k+1}(N+1)
\end{align}
$$ which does not equal $N$, except when $k=N$. Thus, $m$ is a biased estimator of $N$. The bias does become small when the sample size is a significant proportion of $N$, but this is not usually the case in practice.

$(*)$ Note: this equality follows from the Hockey-stick Identity for binomial coefficients: $\sum_{j=k}^{N}{j \choose k} = {{N+1} \choose {k+1}}$

Expected value of $m$ for various values of sample size $k$ (with $N=250$).
Notice that the bias is sizable until $k$ is ~20-30, i.e. 8-12% of $N$.


A few unbiased estimators


We can employ a few common sense observations to derive unbiased estimators of $N$.

Firstly, let $\psi$ be the middle value of the list $1, 2, \dotsc, N$, i.e. $\psi = \frac{1}{2}(N+1)$ (this formula works whether $N$ is odd or even). Then $N=2\psi-1$, and the sample mean $\overline{x} = \frac{1}{k}\sum_{i=1}^{k}{x_i}$ is an unbiased estimator of $\psi$ (you can prove this using induction, for example). This, in turn, implies that $N_1 := 2\overline{x}-1$ is an unbiased estimator of $N$.

The issue with $N_1$ is that it can be less than the sample maximum $m$. For example, if we have a sample of size $k=3$ with $x_1 = 2, x_2 = 3, x_3=10$, then $\overline{x}=5$, and $N_1=9$, but $m=10$. This is an undesirable estimator as we know for sure that $N \geq 10$.

A more sophisticated method for deriving estimates revolves around analysis of the expected "gaps" between observations in the sample. For example, we may expect that, on average, the sample observations $x_1, x_2, \dotsc, x_k$ will be distributed roughly evenly throughout the interval $1, 2, \dotsc, N$. Therefore, we expect the number of unobserved labels below $x_1$ (which is observable from our sample, as the lowest label is $1$) to be equal, on average, to the number of unobserved labels above $m=x_k$ (which is not observable, as $N$ is unknown). Thus, we define our second estimator to be $N_2 :=  m + (x_1 - 1)$.

We can further refine $N_2$ by replacing the "gap" $(x_1 - 1)$ with the average of all the gaps in our sample, thus more fully utilizing the sample data. The average gap size in the sample is $$
\begin{align}
G_{\text{avg}}

&= \frac{1}{k} \left[ (x_1 - 1) + (x_2 - x_1 - 1)
+ \dotsb + (x_k - x_{k-1} - 1) \right] \\[2mm]

&= \frac{1}{k} \left[ x_k - k \right] \\[2mm]

&= \frac{m}{k}-1
\end{align}
$$ Consequently, we define our third estimator to be $N_3 := m + G_{\text{avg}} = m + \left( \frac{m}{k} - 1\right)$. Clearly, $N_2, N_3 \geq m$, so these estimators cannot exhibit the undesirable quality of taking values less than $m$.

Furthermore, $$
\begin{align}
{\Bbb E}[N_3]
&= {\Bbb E}\left[ m + \left( \frac{m}{k} - 1 \right) \right] \\[2mm]
&= {\Bbb E}\left[ m \left( 1+k^{-1} \right) - 1 \right] \\[2mm]
&= \left( 1+k^{-1} \right) {\Bbb E}\left[ m \right] - 1 \\[2mm]
&= \left( 1+k^{-1} \right) \cdot \frac{k}{k+1}(N+1) - 1 \\[2mm]
&= \frac{k+1}{k} \cdot \frac{k}{k+1}(N+1) - 1 \\[2mm]
&= N
\end{align}
$$ so $N_3$ is an unbiased estimator. The proof that $N_2$ is also unbiased follows a similar line of reasoning.


Variance comparison


Below is a comparison of the variances of the three unbiased estimators $N_1, N_2, N_3$ defined above. As I intend this post to be more practical than technical, I will avoid a lengthy computation involving binomial coefficients and simply state the variances below without proof.

Note: the Wikipedia article on the Discrete uniform distribution contains the full derivation of the variance formula for $N_3$; this derivation uses the formulas ${\rm Var}[N_3] = \frac{(k+1)^2}{k^2} {\rm Var}[m]$ and ${\rm Var}[m] = {\Bbb E}[m^2] - ({\Bbb E}[m])^2$. The computation of ${\Bbb E}[m^2]$ is, of course, the lengthy part but is ultimately similar to our derivation of ${\Bbb E}[m]$ above.

The variance formulas are:

  • ${\rm Var}[N_1] = \frac{(k+2)}{3k} \frac{(N-k)(N+1)}{(k+2)}$
  • ${\rm Var}[N_2] = \frac{2}{(k+1)} \frac{(N-k)(N+1)}{(k+2)}$
  • ${\rm Var}[N_3] =  \frac{1}{k} \frac{(N-k)(N+1)}{(k+2)}$
while the standard deviations, which have more recognizable scales (i.e. number of tanks instead of squared number of tanks), are the square roots of respective variances. Let's see how the estimators perform for various sample sizes:



We can see that $N_3$ has the lowest variance of the three for all sample sizes. In the legend, I have marked $N_3$ as "MVUE", which stands for minimum variance unbiased estimator. This means that $N_3$ actually has the lowest variance of any unbiased estimator for $N$, not just the three above.

In Part 3 of this post, I will introduce the mathematical heavy machinery (sorry for the pun) to prove that $N_3$ is indeed the MVUE.

Thanks for reading.


Sources


Wikipedia: German Tank Problem
Wikipedia: Discrete uniform distribution
Estimating the Size of a Population (Roger Johnson) - a paper introducing the unbiased estimators above.
An Empirical Approach to Economic Intelligence in World War II (Richard Ruggles and Henry Brodie, 1947) - the original paper describing the use of these statistical methods during WWII.

Parameter Estimation - Part 1 (Reader Request)

Preliminaries: How do mathematicians model randomness?, Monte Carlo Simulation - Part 2, Proof of the Law of Large Numbers

The preliminary post How do mathematicians model randomness? introduced random variables, their probability distributions, and parameters thereof (namely, mean, variance, and standard deviation). This post, the response to a reader request from Anonymous, will cover estimation of parameters based on random sampling. I will explain the difference between parameters and statistics, introduce the concept of estimator bias, and address the reader request's specific question about unbiased estimation of standard deviation.

In Part 2 of this post, I will present a well known historical application of parameter estimation, the German Tank Problem, and compare methods of estimating an unknown population size. Finally, in Part 3, I will introduce complete and sufficient statistics, which allow us to prove that the best estimator among the candidates in Part 2 is the unique minimum variance unbiased estimator.


Parameters and statistics


Recall from the first preliminary post that we model random phenomena with random variables and their probability distributions. Key characteristics of these distributions (mean, variance, standard deviation, etc.) are called parameters.

Parameters are defined based on all possible values of a random variable (the population) weighted by their relative frequencies of occurrence (i.e. their probabilities). For example, the mean (denoted $\mu_{X}$, ${\Bbb E}(X)$, or simply $\mu$ when there is no ambiguity as to the underlying random variable) of an outcome of interest (i.e. random variable) $X$ is defined as $$
\mu = \sum_{i}{x_i {\Bbb P}(x_i)}
$$ where each $x_i$ is one of the possible values of $X$ and the sum runs over all possible values (in the continuous case, the sum would be replaced by an integral).

In practice, we do not have access to all possible values of $X$ and their probabilities. Instead, we typically have a sample of observed values, $x_1, x_2, \dotsc, x_n$. Given such a sample, we could estimate $\mu$ using the sample mean $$
\bar{x} = \frac{1}{n}\sum_{i=1}^{n}{x_i}
$$ The sample mean is a statistic, a value calculated based on sample data, which can be used to estimate the (unknown) parameter value. Notice that $\bar{x}$ can take on different values depending on which random sample we used to calculate it. In other words, $\bar{x}$ is itself a random variable.


Estimator bias, Bessel's correction


As random variables, statistics have their own probability distributions, known as sampling distributions, and thus their own means, variances, standard deviations, etc. We actually already touched upon this fact in the earlier posts Monte Carlo Simulation - Part 2 and Proof of the Law of Large Numbers, in which we proved that the sample mean $\bar{x}$ has expected value $\mu$ and variance $\frac{\sigma^2}{n}$ (a fact that we will use below).

Since ${\Bbb E}(\bar{x}) = \mu$, we say that the sample mean is an unbiased estimator of the population mean $\mu$. By the same logic, we may estimate the population variance $\sigma^2 = \sum_{i}{(x_i-\mu)^{2}{\Bbb P}(x_i)}$ using the statistic $$
s^2_1= \frac{1}{n} \sum_{i=1}^{n}{(x_i-\bar{x})^2}
$$ However, this statistic is not an unbiased estimator of $\sigma^2$. In order to see why this is the case, we can compute the expected value ${\Bbb E}(\sigma^2 - s_1^2)$, which would be zero if $s_1^2$ were unbiased: $$
\begin{align}
{\Bbb E} \left[ \sigma^2 - s_1^2 \right]

&= {\Bbb E} \left[ \frac{1}{n}\sum_{i=1}^{n}{(x_i - \mu)^2} - \frac{1}{n}\sum_{i=1}^{n}{(x_i-\bar{x})^2}\right] \\[2mm]

&= \frac{1}{n}{\Bbb E}\left[
\sum_{i=1}^{n}{\left(
\left( x_i^2 - 2 x_i \mu + \mu^2) - (x_i^2 - 2 x_i \bar{x} + \bar{x}^2 \right)
\right)} \right] \\[2mm]

&= {\Bbb E}\left[
\mu^2 - \bar{x}^2 + \frac{1}{n}\sum_{i=1}^{n}{(2x_i (\bar{x}-\mu))}
\right] \\[2mm]

&= {\Bbb E}\left[ \mu^2 - \bar{x}^2 + 2\bar{x}(\bar{x}-\mu) \right] \\[2mm]

&= {\Bbb E}\left[ \mu^2 - 2\bar{x}\mu + \bar{x}^2 \right] \\[2mm]

&= {\Bbb E} \left[ (\bar{x} - \mu)^2 \right] \\[2mm]

&= \rm{Var}(\bar{x}) \\[2mm]

&= \frac{\sigma^2}{n}

\end{align}
$$ Since ${\Bbb E}[\sigma^2 - s_1^2] = {\Bbb E}[\sigma^2] - {\Bbb E}[s_1^2] = \sigma^2 - {\Bbb E}[s_1^2]$, the above implies that $$
{\Bbb E}[s_1^2] = \sigma^2 - \frac{\sigma^2}{n} = \frac{n-1}{n}\sigma^2
$$ Therefore, the statistic $s^2 = \frac{n}{n-1}s_1^2 = \frac{1}{n-1}\sum_{i=1}^{n}{(x_i-\bar{x})^2}$, known as the sample variance, has expected value $\sigma^2$ and is thus an unbiased estimator of the population variance.

The replacement of $\frac{1}{n}$ with $\frac{1}{n-1}$ in the sample variance formula is known as Bessel's correction. The derivation above shows that the bias in $s_1^2$ arises due to the fact that $(x_i-\bar{x})$ underestimates the actual quantity of interest, $(x_i-\mu)$, by $(\bar{x}-\mu)$ for each $x_i$. Therefore, the bias is the variance of $\bar{x}$, which we proved to be $\frac{\sigma^2}{n}$ in Proof of the Law of Large Numbers. Using $s^2$ instead of $s_1^2$ corrects for this bias.


Estimation of the standard deviation


Given $s^2$ is an unbiased estimator of $\sigma^2$, we may expect that the sample standard deviation $s=\sqrt{s^2}$ would also be an unbiased estimator of the population standard deviation $\sigma$. However, $$
\begin{align}
{\Bbb E}\left[ s^2 \right] &= \sigma^2 \\[2mm]
\Rightarrow {\Bbb E}\left[ \sqrt{s^2} \right] &< \sqrt{{\Bbb E}\left[ s^2 \right]} = \sqrt{\sigma^2} = \sigma
\end{align}
$$ where the inequality follows from Jensen's inequality and the fact that the square root is a concave function (since the area above it is concave, not convex). In other words, $s$ underestimates $\sigma$ on average.

Unfortunately, for estimating the population standard deviation, there is no easy correction as there is for the variance. The size of the necessary correction depends on the distribution of the underlying random variable. For the Normal distribution, there is a complicated exact formula, but simply replacing the $n-1$ in the denominator with $n-1.5$ eliminates most of the bias (with the remaining bias decreasing with increasing sample size). A further adjustment is possible for other distributions and depends on the excess kurtosis, a measure of the "heavy-tailedness" of the distribution in excess of that of the Normal distribution.

While the specific corrections are beyond the scope of this post, for the brave, there is an entire Wikipedia article dedicated to exactly this topic.


Other measures of estimator quality


Zero bias is certainly a desirable quality for a statistic to have, but an estimator's quality depends on more than just its expected value. A statistic's variance tells us how large of a spread (from its expected value) we may expect when calculating the statistic based on various samples. Just as a statistic with large bias is not particularly helpful, neither is one with no bias but a large variance.


The notion of consistency ties bias and variance together nicely: a consistent estimator is one which converges in probability to the population parameter. This means that, as $n \rightarrow \infty$, the probability of an error greater than some specified amount $\epsilon$ approaches zero. This further implies that both the bias and the variance tend to zero as the sample size grows.

For example, the (weak) law of large numbers implies that $\bar{x}$ is a consistent estimator of $\mu$, as $\lim_{n \rightarrow \infty}{{\Bbb P} \left[ \left| \bar{x} - \mu \right| \geq \epsilon \right]} = 0$ for any $\epsilon > 0$. Furthermore, $s_1^2$ and $s^2$ are both consistent estimators of $\sigma^2$, while $s$ is a consistent estimator of $\sigma$. These examples show that both biased and unbiased estimators can be consistent.

That will do it for Part 1 of this post. Thanks for reading, and look out for Parts 2 and 3, coming up soon. Thanks to Anonymous for the great reader request.


Sources:


Wikipedia- unbiased estimation of standard deviation
Wikipedia - Bessel's correction
Quora post- estimator bias vs. variance