Powered by Blogger.

Parameter Estimation - Part 3 (Minimum-Variance Unbiased Estimator)

Preliminaries: How do mathematicians model randomness?Parameter Estimation - Part 1 (Reader Request)Parameter Estimation - Part 2 (German Tank Problem)

First off, a housekeeping note: apologies to all for the long delay between the last post and this one (over a year...oops!). Each post does take me a while behind the scenes, and work plus other non-math activities (mainly history stuff) partly deprived me of the time I needed to put this together. That being said, I will try to keep that gap shorter going forward. Now, on to the actual content.

In Part 2, we examined various unbiased estimators of the size of a population based on a random sample of serial numbers, concluding that the estimator $N_3 = m + (\frac{m}{k} - 1)$ (where $x_k = m$ is the maximum serial number observed in the sample, and $k$ is the sample size) had lower variance than the others. I also made the claim that this estimator has the lowest variance of any unbiased estimator of the population size. In this Part 3, I will introduce the concepts of completeness and sufficiency of an estimator, as well as the theorems that allow us to use these properties to find the minimum-variance unbiased estimator (and prove it is unique).


Statistics/parameters recap; some new notation


We saw in Parts 1 and 2 that we can use an observable statistic or estimator, associated with a sample, to estimate an unobservable parameter, associated with the entire population. For example, we used formulas based on the sample maximum, $x_k = m$, to estimate the population size/maximum, $N$ in the German Tank Problem. The values of the statistic have a probability distribution, called the sampling distribution of the statistic, as different samples of size $k$ will yield different values of the estimator.

If the mean (a.k.a. expected value) of the sampling distribution of a statistic equals the value of the parameter, then the statistic is unbiased; otherwise, it's biased. Furthermore, there is also a variance associated with the estimation. Now, there can be a trade-off between bias and variance (i.e. we could accept an estimator we know to be slightly biased for a significant reduction of variance vs. an unbiased alternative), a concept I did not delve into in the past posts. Among unbiased estimators though, we would prefer one with as little variance as possible. As the title suggests, we'll continue to focus only on unbiased estimators in this post.

In Part 2, we saw that the sampling distribution of the sample maximum was $$
\Bbb{P}[x_k = m] = f(m) = \frac{m-1 \choose k-1}{N \choose k}
$$ which obviously depends on the value of the parameter, $N$. The probability distribution would be a different one for a different value of the parameter. Thus, we have here a statistical model for randomly choosing a sample of $k$ serial numbers from a population of size $N$ (and from a discrete uniform distribution over $\lbrace 1, 2, \dots , N \rbrace$).

The definitions in this post will require us to be a bit more rigorous than before, thinking of our statistical model as a family of probability distributions $\Bbb{P}_\theta$, parameterized, or indexed, by the value of a parameter $\theta$ (a common letter to use for a parameter in this context). In this post, I will use the notations $\Bbb{P}_\theta$ and $\Bbb{E}_\theta$ to denote the probability and expected value, respectively, for a particular value, $\theta$, of the relevant parameter within the model in question. I'll call the set of possible parameter values $\Theta$, and thus I'll refer to the model itself as $\scr{P} = \lbrace {\Bbb P}_\theta \, | \, \theta \in \Theta \rbrace$.

We'll also need the concept of conditional probability. The probability that $A$ will occur, given that $B$ has occurred, is denoted ${\Bbb P}[A|B]$ (read "probability of $A$, given $B$"). In general, the probability that $A$ and $B$ occur is ${\Bbb P}[A \cap B] = {\Bbb P}[A|B]{\Bbb P}[B] \, (\clubsuit)$. We can see that this equality is true by rearranging as ${\Bbb P}[A|B]={\Bbb P}[A \cap B] / {\Bbb P}[B]$: the probability of $A$, given $B$, is the probability of $A$'s intersection with $B$, divided by the probability of the entire $B$, i.e. the cases of interest divided by the total number of possible (given $B$) cases.

Finally, we'll need the law of total expectation: ${\Bbb E}[{\Bbb E}[A | B]]={\Bbb E}[A]$. Note that ${\Bbb E}[A|B]$ is a function of $B$, $f(b)={\Bbb E}[A \, | \, B=b]$; the outer expectation then averages over the possible values $b$ of $B$.

That the law of total expectation is true, is illustrated by a simple example from the Wikipedia article (I will include the link in Sources below): if factory $X$ makes 60% of light bulbs at your hardware store, and $X$'s light bulbs last 5,000 hours on average, and factory $Y$ makes the other 40%, with $Y$'s bulbs lasting 4,000 hours on average, then the expected lifetime $L$ of a random bulb purchased would be: $$
{\Bbb E}[L] =
\underbrace{5{,}000 \times .6}_{{\Bbb E}[L|X] \times {\Bbb P}[X]}
+
\underbrace{4{,}000 \times .4}_{{\Bbb E}[L|Y] \times {\Bbb P}[Y]}
= 4{,}600
$$ The formal proof of the law of total expectation just involves manipulations of a double-sum, so I won't include it here.


Complete statistics


Let $X$ be a random variable within a parametric model $\scr{P}$, and let ${\mathbf x} = (x_1, x_2, \dotsc, x_k)$ be a sample of values of $X$. Let $T$ be a statistic, i.e. a function of $\mathbf x$ (thus, below, $T$ will be understood to mean $T({\mathbf x})$).

The statistic $T$ is called complete if, for any (measurable*) function $g$, $$
{\Bbb E}_\theta \left[ g(T) \right] = 0, \forall \theta \in \Theta
\implies
{\Bbb P}_\theta \left[ g(T)=0 \right] = 1, \forall \theta \in \Theta
$$ This is a bulky definition, so let's unpack it. The $\implies$ arrow means that if the statement on the left of the arrow is true, then the statement on the right of the arrow is also true.

The statement on the left is saying that, for every value $\theta$ of the parameter, the expected value of $g(T)$ (under the specific version of the model with $\theta$ as the parameter) is zero, for a (measurable) function $g$.

The statement on the right is saying that, for every value $\theta$ of the parameter, the probability (under the specific version of the model with $\theta$ as the parameter) that $g(T)=0$ is 1, which means that $g$ is the zero function almost everywhere. This means that the set $\lbrace {\bf x} \, | \, g(T({\bf x})) \neq 0 \rbrace$ has probability zero (though it is not necessarily the empty set, an important distinction). When this is the case, $g$ is, for probabilistic purposes, essentially the same as the zero function.

*Note: "measurable" has a specific meaning beyond the scope of this post, but this basically just serves to rule out ridiculous functions. For all intents and purposes, any function you can think of is measurable.

$g(T)$ in this definition should be thought of as an estimator (for some parameter) based on the statistic $T$. In the German Tank Problem example, $T$ could be the max function $T({\mathbf x})=\max (x_i) =m$, and $g(T)$ would be some function of that. If ${\Bbb E}_\theta \left[ g(T) \right] = 0$, we call $g(T)$ an unbiased estimator of zero.

So this definition of completeness is really saying that, $T$ is a complete statistic if the only way to obtain an unbiased estimator of zero from $T$, is to apply an (almost-everywhere) zero function to $T$. Equivalently, any estimator formed by application of a not-a.e. zero function to the complete statistic $T$ will not result in an unbiased estimator of zero.

Note that whether a statistic is complete or not depends on the model: the "$\forall \theta \in \Theta$" part of the definition means that completeness may depend on which values of $\theta$ are, or are not, permitted in the model. For this reason, it is technically more correct to say that the family of probability distributions ${\scr F}_T = \lbrace f_T(t|\theta) \, | \, \theta \in \Theta \rbrace$ is complete, where the $f_T(t| \theta)$ are of course the probability distributions of $T$ for different values of the parameter. However, it is also common to simply say that "the statistic $T$ is complete," and I will use the simpler terminology for the remainder of this post. It is not too difficult to come up with examples in which placing a restriction on $\Theta$ changes the completeness status of a statistic; I'll provide links in Sources that include a few such examples, but in the interest of post length, I won't go into one here.

This rather technical definition of completeness serves an important purpose. As the following Proposition shows, the property of completeness, as defined, guarantees the uniqueness of unbiased estimators.

Proposition 1: If a statistic $T$ is complete, then any unbiased estimator of a parameter $\theta$, based on $T$ (i.e. any estimator $g(T)$ that is a function of $T$), is essentially unique.

More precisely, the term essentially unique means that, if there is another function $h(T)$ that is an unbiased estimator of $\theta$, then $h(T)=g(T)$ almost everywhere. In other words, the set of values of $T$, for which $h \neq g$, has probability zero: ${\Bbb P}\left[ \lbrace t=T({\bf x}) \, | \, h(t) \neq g(t) \rbrace \right]=0$.

Proof: Suppose $g_1(T)$ and $g_2(T)$ are two different estimators based on $T$ that both have expected value $\theta$, i.e. both are unbiased estimators of a parameter $\theta$. Then the function $g(T)=g_1(T)-g_2(T)$ is an unbiased estimator of zero: $$
{\Bbb E}[g(T)] = {\Bbb E}[g_1(T)] - {\Bbb E}[g_2(T)] = \theta - \theta = 0
$$ Since the assumption was that $g = g_1 - g_2$ was not the (a.e.) zero function, this would contradict the fact that $T$ is complete, so we must instead have $g_1(T)=g_2(T)$ almost everywhere. Thus, the unbiased estimator of $\theta$ based on $T$ is essentially unique. $\square$


The sample max in the German Tank Problem is a complete statistic


Because $\Bbb{P}_\theta [x_k = m] = {m-1 \choose k-1}/{\theta \choose k}$ (here, we've replaced the population maximum $N$ with the letter $\theta$ in keeping with the notation from earlier in this post), we also know that, for a function $g$, and using $T$ as the sample maximum statistic: $$
{\Bbb E}_\theta[g(T)] = \sum_{j=1}^{\theta}{g(j) {\Bbb P}_\theta[T=j]} = \sum_{j=1}^{\theta}{g(j) \frac{j-1 \choose k-1}{\theta \choose k}}
$$ Suppose that the above expected value is $0$ for all permitted values of the parameter $\theta$, which, in this model, are $\lbrace 1, 2, 3, \dotsc \rbrace$. Then, in particular, it is $0$ for $\theta = 1$. In that simple case, the sum has only one term, $j=1$, and since the fraction becomes ${0 \choose 0} / {1 \choose 1}=1/1=1$, we must have $g(1)=0$ in order for the whole sum to equal zero.

The expectation must also be $0$ for the next case, $\theta=2$. Plugging in $g(1)=0$ from above, we are left with only one term in the two-term sum. Since the fraction of binomial coefficients is again non-zero for either $k=1$ or $k=2$, we must have $g(2)=0$ by the same reasoning as above.

Continuing with this inductive argument, we obtain $g(j)=0$ for all values of $j$, i.e. $g$ must be the zero function in order to have ${\Bbb E}_\theta[g(T)]=0, \forall \theta \in \Theta$. This proves that the sample maximum $T$ is a complete statistic with respect to the model ${\scr P}$.


Sufficient statistics and the Factorization Theorem


Proposition 1 suggests that, if we can somehow use a complete statistic $T$ to derive a minimum-variance unbiased estimator, then that estimator will be unique. Sufficiency is the property we need in order to actually find an unbiased estimator with minimum variance.

An estimator has variance because different samples yield different values of the estimator. Furthermore, some information in the samples is relevant to the estimation of the parameter, while other information is not. For example, from Part 2, while common sense dictates that the sample maximum in the German Tank Problem should certainly be relevant in estimating the population size, the estimator $N_1=2{\bar x}-1$ is based on the sample mean. Similarly, estimator $N_2=m+(x_1-1)$, though doing better since it's based on $m$, includes a "gap" adjustment based on the lowest serial number in the sample. These facts about the sample are "noise": they don't help us estimate $N$ more accurately.

Indeed, for both $N_1$ and $N_2$, different samples, all of size $k$ and with the same sample maximum $m$, can yield different estimator values. $N_3 = m + (\frac{m}{k}-1)$, on the other hand, depends only on the sample max $m$ and sample size $k$ (recall from Part 2 that the "average gap" adjustment simplified down to depend only on these). While different samples will have different values of $m$, yielding some estimator variance that is unavoidable, if we are given a certain value of $m$, we can have only one value of the estimator $N_3$. Thus, $N_3$ has less "avoidable noise" than the other estimators and should be the least volatile.

The discussion above is the intuition behind the following definition: a statistic $T({\bf x})$ is called sufficient (for the parameter $\theta$) if the conditional probability ${\Bbb P}_\theta[X={\bf x} \, | \, T({\bf x})=t]$ is a function of the value $t$ of the statistic and does not depend on $\theta$.

If $T$ is a sufficient statistic, then the probability that the sample takes on a particular value ${\bf x}$ depends on $\theta$ only through the value $t$ of the statistic. Flipping this around, when we estimate the value of the unknown parameter $\theta$ based on the sample ${\bf x}$, then the value $t=T({\bf x})$ of the statistic already contains all the information from the sample that is relevant to estimating $\theta$. For this reason, a sufficient statistic $T$ is sometimes referred to as a lossless compression of the sample ${\bf x}$.

The following theorem characterizes sufficient statistics.

Fisher-Neyman Factorization Theorem: Let $f_X({\bf x}|\theta)$ be the probability distribution of $X$, given the parameter value $\theta$. A statistic $T(X)$ is sufficient for the parameter $\theta$ if and only if there exist functions $g$ and $h$ such that $$
f_X({\bf x}|\theta) = h({\bf x})g(\theta, T({\bf x}))
$$ In other words, the probability density function of $X$ can be factored into a product such that one factor, $h$, does not depend on $\theta$, and the other factor, which does depend on $\theta$, depends on the sample ${\bf x}$ only through the statistic $t=T({\bf x})$.

This theorem makes some intuitive sense given the discussion above; a thorough reading of the proof helped me understand the symbols in terms of this intuition (and vice-versa). Note that this proof only works for the case of discrete random variables, but the theorem also applies in the case of continuous r.v.'s.

Proof:
Sufficiency $\implies$ factorization:
First, note the trivial fact that, since $T$ is a function of $X$, ($T=T({\bf x})$ and $X={\bf x}$) if and only if $X={\bf x}$. This means that ${\Bbb P}_\theta [X={\bf x}]={\Bbb P}_\theta [X={\bf x} \cap T=T({\bf x})]$, or in terms of the probability mass functions, $f_X({\bf x}|\theta)=f_{X,T(X)}({\bf x},T({\bf x})|\theta)$. Therefore: $$
\begin{align}

f_X({\bf x}|\theta)

&= f_{X,T(X)}({\bf x},T({\bf x})|\theta) \\[2mm]

&\stackrel{(\clubsuit)}{=}
\underbrace{f_{X|T(X)}({\bf x}|T({\bf x}),\theta) }_{h({\bf x})}
\cdot
\underbrace{f_{T({\bf x})}(T({\bf x})|\theta)}_{g(\theta,T({\bf x}))}

\end{align}
$$ The second equality is just equation $(\clubsuit)$ from the intro, written in terms of the pmf's rather than the probabilities. Since $T$ is sufficient, the conditional pmf $f_{X|T(X)}({\bf x}|T({\bf x}))$ is a function of $t=T({\bf x})$ (which, in turn, is a function of ${\bf x}$) and does not depend on $\theta$, so we can write it as $h({\bf x})$. The second pmf, $f_{T({\bf x})}(T({\bf x})|\theta)$, depends only on $t=T({\bf x})$ and $\theta$, so we can write it as $g(\theta,T({\bf x}))$, yielding the desired factorization formula.

Factorization $\implies$ sufficiency:
We turn again to the pmf version of $(\clubsuit)$: $$
f_{X|T(X)}({\bf x}|t,\theta)=\frac{f_{X,T(X)}({\bf x},t|\theta)}{f_{T(X)}(t|\theta)} \, \, \, \, (1)
$$ The numerator on the right-hand side is equal to $f_X({\bf x}|\theta)$ if $t=T({\bf x})$, and zero if $t \neq T({\bf x})$, again since $T$ is a function of $X$. Assuming the factorization formula, $f_X({\bf x}|\theta) \stackrel{(2)}{=} h({\bf x})g(\theta,t)$ when $t=T({\bf x})$.

For the denominator, since $T$ is a function of $X$, we have ${\Bbb P}_\theta[T=t]=\sum{\Bbb P}_\theta[X={\bf \tilde x}]$, where the sum is taken over the values of the dummy variable ${\bf \tilde x}$ for which $T({\bf \tilde x})=t$. Thus: $$
f_{T(X)}(t|\theta)
= \sum_{{\bf \tilde x}: T({\bf \tilde x})=t}{f_X({\bf \tilde x}|\theta)}
\stackrel{(3)}{=} \sum_{{\bf \tilde x}: T({\bf \tilde x})=t}{\left[ h({\bf \tilde x})g(\theta,t) \right]}
$$ where the first equality is the pmf version of the probability statement above, and the second equality follows from the assumption of the factorization formula for $f_X({\bf x}|\theta)$.

Plugging $(2)$ and $(3)$ into $(1)$ yields: $$
\begin{align}
f_{X|T(X)}({\bf x}|t,\theta)

&=\frac
{h({\bf x})g(\theta,t)}
{\sum_{{\bf \tilde x}:T({\bf \tilde x})=t}{\left[ h({\bf \tilde x})g(\theta,t) \right]}} \\[2mm]

&=\frac
{h({\bf x})g(\theta,t)}
{\sum_{{\bf \tilde x}:T({\bf \tilde x})=t}{\left[ h({\bf \tilde x}) \right]g(\theta,t)}}

=\frac
{h({\bf x})}
{\sum_{{\bf \tilde x}:T({\bf \tilde x})=t}{\left[ h({\bf \tilde x}) \right]}}
\end{align}
$$ where we took the $g(\theta,t)$ out of the sum in the second equality since it doesn't depend on ${\bf \tilde x}$. Since the final expression does not depend on $\theta$, we have proved that $T$ is sufficient.
$\square$


The sample max in the German Tank Problem is a sufficient statistic


To show that the sample max is a sufficient statistic, we can use the Factorization Theorem above. Let ${\bf 1}_A$ be the function that equals $1$ if $A$ is true and $0$ otherwise (this is called the indicator function of event $A$), and, for a sample ${\bf x}=(x_1,x_2,\dotsc,x_k)$, let the statistic $T({\bf x})=\max(x_i)$, the sample max.

The probability mass function for the sample $X$ taking the value ${\bf x}=(x_1, x_2, \dotsc, x_k)$, given the population size is $\theta$, is: $$
\begin{align}
f_X({\bf x} \, | \, \theta)

&= \left( \frac{1}{\theta}{\bf 1}_{\lbrace 1 \leq x_1 \leq \theta \rbrace} \right)
\left( \frac{1}{\theta}{\bf 1}_{\lbrace 1 \leq x_2 \leq \theta \rbrace} \right)
\dotsb
\left( \frac{1}{\theta}{\bf 1}_{\lbrace 1 \leq x_k \leq \theta \rbrace} \right) \\[2mm]

&= \frac{1}{\theta^n} {\bf 1}_{\lbrace 1 \leq \min(x_i) \rbrace} {\bf 1}_{\lbrace \max(x_i) \leq \theta \rbrace} \\[2mm]

&= \underbrace{
\vphantom{\frac{1}{\theta^n}}
{\bf 1}_{\lbrace 1 \leq \min(x_i) \rbrace}
}_{h({\bf x})}
\,
\underbrace{\frac{1}{\theta^n} {\bf 1}_{\lbrace T({\bf x}) \leq \theta \rbrace}}_{g(\theta,T({\bf x}))}

\end{align}
$$ Thus, by the Factorization Theorem, the sample max $T(X)$ is a sufficient statistic for the population size $\theta$.


The Rao-Blackwell and Lehmann-Scheffé Theorems


The discussion above suggests that completeness and sufficiency should help us obtain a unique estimator with minimal variance. The Rao-Blackwell and Lehmann-Scheffé Theorems prove this definitively. The first of these two gives us a process to reduce the variance of an estimator:

Clockwise from top-left: Rao, Blackwell, Scheffé, Lehmann
All four images courtesy of Wikipedia.

Rao-Blackwell Theorem:
Let $U=u(X)$ be an unbiased estimator of $\theta$, and let $T=t(X)$ be a sufficient statistic for $\theta$. Then the Rao-Blackwell estimator $$
U^*=u^*({\bf x}) = {\Bbb E}_{\theta}[U \, | \, T=t({\bf x})]
$$ is also unbiased, and its variance is less than that of $U$.

In the proof below, all ${\Bbb E}$'s are understood to stand for ${\Bbb E}_{\theta}$.

Proof: By the definition of sufficiency given above, $u^*({\bf x})$ is indeed a function of ${\bf x}$, since the conditional expectation of $U$, given the sufficient statistic $T$, does not depend on $\theta$. Also, $U^*$ is unbiased by the law of total expectation and the fact that $U$ is unbiased: $$
{\Bbb E}[U^*] = {\Bbb E}[{\Bbb E}[U|T]] = {\Bbb E}[U] = \theta
$$ Furthermore, the variance of $U^*$ is less than that of $U$: $$
\begin{align}
{\rm var}[U^*]

&= {\Bbb E}\left[ \left( U^* - {\Bbb E}[U^*] \vphantom{{\Bbb E}[U^*]^2} \right)^2 \right]  \\[2mm]

&= {\Bbb E}\left[ \left( {\Bbb E}[U|T] - \theta \vphantom{{\Bbb E}[U^*]^2} \right)^2 \right] \\[2mm]

&= {\Bbb E}\left[ \left( {\Bbb E}[U-  \theta \, | \, T] \vphantom{{\Bbb E}[U^*]^2} \right)^2 \right] \tag{3}\\[2mm]


&\leq {\Bbb E}\left[ {\Bbb E}\left[ \left( U-  \theta \vphantom{{\Bbb E}[U^*]^2} \right)^2 \, \middle| \, T \right] \right] \tag{4} \\[2mm]

&= {\Bbb E}\left[ \left( U-  \theta \vphantom{{\Bbb E}[U^*]^2} \right)^2 \right] \tag{5} \\[2mm]

&= {\Bbb E}\left[ \left( U-  {\Bbb E}[U] \vphantom{{\Bbb E}[U^*]^2} \right)^2 \right] \\[2mm]

&= {\rm var}[U]

\end{align}
$$ The third equality is true because $\theta$ is a constant (and ${\Bbb E}$ is linear), the inequality in line 4 is due to Jensen's inequality, and the equality in line 5 is the law of total expectation. $\square$

Rao-Blackwell is telling us that if we supply an unbiased estimator, we can condition on a sufficient statistic to obtain a new unbiased estimator with lower variance. This sounds like a good process to source MVUE candidates. The Lehmann-Scheffé Theorem is the last piece of the puzzle, and we need one more Proposition to prove it.

Proposition 2: If an MVUE exists, it is essentially unique. That is, if $U$ and $U'$ are two MVUE's, then $U=U'$ almost everywhere.

Proof: For random variables $X$ and $Y$, and constants $c, c_1, c_2$, we have the following variance and covariance formulas (which I won't derive in detail here): $$
\begin{align}
{\rm var}[X+Y] &= {\rm var}[X]+{\rm var}[Y] + 2{\rm cov}(X,Y) \tag{1}\\[2mm]
{\rm var}[cY] &= c^2 {\rm var}[Y] \tag{2} \\[2mm]
{\rm cov}(c_1 X,c_2 Y) &= c_1 c_2 {\rm cov}(X,Y) \tag{3}\\
\end{align}
$$ So, let $U$ be an MVUE and $U'$ another MVUE. Since $U$ and $U'$ both have expected value $\theta$, their difference has expected value zero. Thus, if we can show that ${\rm var}[U'-U]=0$, that would mean that $U'-U$ is a constant almost everywhere, and thus $U'=U$ almost everywhere. Using the formulas above, for any constant $c$, we have: $$
{\rm var}[U+c(U'-U)] = {\rm var}[U] + c^2 {\rm var}[U'-U] + 2c \, {\rm cov}(U, U'-U) \, \,(\spadesuit)
$$ Assume ${\rm var}[U'-U] = \gamma >0$ (we will show this leads to a contradiction), and let $\rho = -{\rm cov}(U,U'-U)$. Equation $(\spadesuit)$ must be true for any value of $c$, so let's plug in $c=\rho / \gamma$: $$
\begin{align}
{\rm var}[U+c(U'-U)]

&= {\rm var}[U] + \left( \frac{\rho}{\gamma} \right)^2 \gamma -2\left( \frac{\rho}{\gamma} \right) \rho \\[2mm]

&= {\rm var}[U] - \rho^2 / \gamma
\end{align}
$$ Since $U$ is an MVUE, the variance of $U+c(U'-U)$ can't be less than that of $U$. Thus, since $\rho^2 / \gamma \geq 0$, we must have $\rho^2 / \gamma = 0 \implies \rho = -{\rm cov}(U,U'-U)=0$.

Plugging this into $(\spadesuit)$, and now using the value $c=1$, we obtain: $$
\begin{align}
{\rm var}[U+1 \cdot (U'-U)] &= {\rm var}[U] + 1^2 \cdot {\rm var}[U'-U] + 0 \\[2mm]
{\rm var}[U'] &= {\rm var}[U] + \gamma
\end{align}
$$ But $U'$ is an MVUE, so its variance cannot be greater than that of $U$, and so $\gamma = 0$. This is a contradiction, so we must have ${\rm var}[U'-U]=0$, and thus $U'=U$ almost everywhere. $\square$

The proof of the following theorem will now be nice and easy, based on the discussion above.

Lehmann-Scheffé Theorem: If $T$ is a complete and sufficient statistic for a parameter $\theta$, and $U$ is an unbiased estimator of $\theta$, then $U^*={\Bbb E}[U|T]$ is the unique MVUE for $\theta$.

Proof: Firstly, let's break down the notation. Recall from the intro section that $U^*={\Bbb E}[U|T]$ is indeed a function of the value of $T$. Indeed, if ${\bf x}$ is the value of our sample, then the statistic $T$ has the value $t=T({\bf x})$, and $U^*({\bf x}) = g(t) = {\Bbb E}[U \, | \, T({\bf x})=t]$. Recall that we can write this in the simpler form $U^*=g(T)$, with the dependence on the sample ${\bf x}$ implied.

By Rao-Blackwell, we know that (1) $U^*$ is unbiased (since $U$ is unbiased, and $T$ is sufficient), and (2) ${\rm var}[U^*] \leq {\rm var}[U]$. Thus, $U^*$ is an MVUE candidate. Now, let $V$ be another MVUE candidate. We will prove that $V=U^*$ almost everywhere.

Define $V^* = {\Bbb E}[V|T] := h(T)$. Also by Rao-Blackwell, $V^*$ is unbiased and has variance less than or equal to that of $V$. But since $V$ is an MVUE, ${\rm var}[V^*] \nless {\rm var}[V]$, so we must have ${\rm var}[V^*]={\rm var}[V]$, and $V^*$ is also an MVUE. By Proposition 2, the MVUE is essentially unique, so $V^*=V$ almost everywhere.

Furthermore, by Proposition 1, any unbiased estimator that is a function of $T$ is essentially unique. Thus, the following is true (almost everywhere): $$
V = V^* = h(T) = g(T) = U^*
$$ Thus, $U^*$ is the unique MVUE of $\theta$. $\square$

All of the above theorems and propositions allow us to conclude that all of the following are true:

  • If we can find an unbiased estimator $U$ of $\theta$ as well as a complete and sufficient statistic $T$, then the Rao-Blackwell estimator $U^* = {\Bbb E}[U|T]$ is the MVUE of $\theta$ (Lehmann-Scheffé). The MVUE is a function of $T$.
  • This function of $T$ is essentially unique (Proposition 1).
  • An MVUE is essentially unique (Proposition 2). This means that, regardless of which complete and sufficient statistic $T$ we use to Rao-Blackwellize, we will obtain the same MVUE.
  • Similarly, we will obtain the same MVUE regardless of which unbiased estimator $U$ we start with.
The various sources listed below all come to these conclusions, often in different orders or stating some a bit differently. In Sources below, I include some commentary on the approaches of the respective authors.

Finally, turning back to the unbiased estimator $N_3$ from the German Tank Problem, it is now easy to see why it is the MVUE. We proved above that $T({\bf x})=\max(x_i)=m$ is complete and sufficient. As $N_3$ only depends on $m$, conditioning on $T$ obviously adds no additional information, so $N_3^* = {\Bbb E}[N_3 \, | \, T = m] = N_3$. By the Lehmann-Scheffé Theorem, $N_3$ is thus the unique MVUE for the population size $\theta$. An even simpler explanation is that $N_3$ is an unbiased function of the complete, sufficient statistic $T$. Since the MVUE is the unique unbiased function of $T$, $N_3$ is the MVUE.

This post was a bit more on the technical side, but I do hope folks enjoyed seeing the rigorous logic needed to prove that $N_3$ is the MVUE, as well as the heuristic observations / intuition relating to each step. I hope to have more content for you soon, and as always, I welcome any reader requests for topics. Until then, thanks for reading.


Sources


Wikipedia:
Law of Total Expectation
Rao-Blackwell Theorem
Lehmann-Scheffé Theorem
Note: I believe the proof of the Lehmann-Scheffé Theorem in the Wikipedia is missing one step: namely, it proves that the candidate MVUE is the unique function of the complete-sufficient statistic that could be the MVUE, but it never proves that the MVUE must indeed be a function of the complete-sufficient statistic. The other sources below, as well as the proof I included above, remedy this omission. If I'm missing something in the Wikipedia proof, I'd appreciate if someone would let me know in the comments.

FSU lecture notes (Debdeep Pati):
Complete Statistics
Note: These lecture notes contain numerous examples in which changing the parameter space $\Theta$ changes a statistic's completeness status. I promised above that I would include a source with such examples, since I did not include any in the post itself.

Stack Exchange:
A thread on the intuition behind the definition of completeness
Note: this thread also contains an example of a change in the completeness status of a statistic based on a change in $\Theta$.

OU lecture notes (Christian Remling):
Sufficiency
Completeness and Sufficiency
These lecture notes offer a thorough treatment of the topics with numerous examples other than just the German Tank Problem (e.g. the coin flip model). The notes also contain the continuous analog of the GTP, in which proving completeness and sufficiency of the sample max requires calculus.

Arizona lecture notes (Joe Watkins):
Sufficient Statistics
Another thorough treatment, I found Watkins's notation for the pmf's easiest to follow in these notes and thus adopted that notation throughout this post. I also followed his proof of the Fisher-Neyman Factorization Theorem, with a bit more explanation added in for my readers.

Oxford lecture notes (Steffen Lauritzen):
Sufficiency and Unbiased Estimation
In reaching the results of the Lehman-Scheffé Theorem, these notes squeeze as much as possible out of sufficiency before finally invoking completeness. I used Lauritzen's proofs of (1) the Rao-Blackwell Theorem, (2) the essential uniqueness of the MVUE, and (3) the fact that the MVUE must be a function of a sufficient statistic (the latter being rolled up into my proof of the Lehmann-Scheffé Theorem, which is the same way Remling handled it).