Powered by Blogger.

Parameter Estimation - Part 3 (Minimum-Variance Unbiased Estimator)

Preliminaries: How do mathematicians model randomness?Parameter Estimation - Part 1 (Reader Request)Parameter Estimation - Part 2 (German Tank Problem)

First off, a housekeeping note: apologies to all for the long delay between the last post and this one (over a year...oops!). Each post does take me a while behind the scenes, and work plus other non-math activities (mainly history stuff) partly deprived me of the time I needed to put this together. That being said, I will try to keep that gap shorter going forward. Now, on to the actual content.

In Part 2, we examined various unbiased estimators of the size of a population based on a random sample of serial numbers, concluding that the estimator $N_3 = m + (\frac{m}{k} - 1)$ (where $x_k = m$ is the maximum serial number observed in the sample, and $k$ is the sample size) had lower variance than the others. I also made the claim that this estimator has the lowest variance of any unbiased estimator of the population size. In this Part 3, I will introduce the concepts of completeness and sufficiency of an estimator, as well as the theorems that allow us to use these properties to find the minimum-variance unbiased estimator (and prove it is unique).


Statistics/parameters recap; some new notation


We saw in Parts 1 and 2 that we can use an observable statistic or estimator, associated with a sample, to estimate an unobservable parameter, associated with the entire population. For example, we used formulas based on the sample maximum, $x_k = m$, to estimate the population size/maximum, $N$ in the German Tank Problem. The values of the statistic have a probability distribution, called the sampling distribution of the statistic, as different samples of size $k$ will yield different values of the estimator.

If the mean (a.k.a. expected value) of the sampling distribution of a statistic equals the value of the parameter, then the statistic is unbiased; otherwise, it's biased. Furthermore, there is also a variance associated with the estimation. Now, there can be a trade-off between bias and variance (i.e. we could accept an estimator we know to be slightly biased for a significant reduction of variance vs. an unbiased alternative), a concept I did not delve into in the past posts. Among unbiased estimators though, we would prefer one with as little variance as possible. As the title suggests, we'll continue to focus only on unbiased estimators in this post.

In Part 2, we saw that the sampling distribution of the sample maximum was $$
\Bbb{P}[x_k = m] = f(m) = \frac{m-1 \choose k-1}{N \choose k}
$$ which obviously depends on the value of the parameter, $N$. The probability distribution would be a different one for a different value of the parameter. Thus, we have here a statistical model for randomly choosing a sample of $k$ serial numbers from a population of size $N$ (and from a discrete uniform distribution over $\lbrace 1, 2, \dots , N \rbrace$).

The definitions in this post will require us to be a bit more rigorous than before, thinking of our statistical model as a family of probability distributions $\Bbb{P}_\theta$, parameterized, or indexed, by the value of a parameter $\theta$ (a common letter to use for a parameter in this context). In this post, I will use the notations $\Bbb{P}_\theta$ and $\Bbb{E}_\theta$ to denote the probability and expected value, respectively, for a particular value, $\theta$, of the relevant parameter within the model in question. I'll call the set of possible parameter values $\Theta$, and thus I'll refer to the model itself as $\scr{P} = \lbrace {\Bbb P}_\theta \, | \, \theta \in \Theta \rbrace$.

We'll also need the concept of conditional probability. The probability that $A$ will occur, given that $B$ has occurred, is denoted ${\Bbb P}[A|B]$ (read "probability of $A$, given $B$"). In general, the probability that $A$ and $B$ occur is ${\Bbb P}[A \cap B] = {\Bbb P}[A|B]{\Bbb P}[B] \, (\clubsuit)$. We can see that this equality is true by rearranging as ${\Bbb P}[A|B]={\Bbb P}[A \cap B] / {\Bbb P}[B]$: the probability of $A$, given $B$, is the probability of $A$'s intersection with $B$, divided by the probability of the entire $B$, i.e. the cases of interest divided by the total number of possible (given $B$) cases.

Finally, we'll need the law of total expectation: ${\Bbb E}[{\Bbb E}[A | B]]={\Bbb E}[A]$. Note that ${\Bbb E}[A|B]$ is a function of $B$, $f(b)={\Bbb E}[A \, | \, B=b]$; the outer expectation then averages over the possible values $b$ of $B$.

That the law of total expectation is true, is illustrated by a simple example from the Wikipedia article (I will include the link in Sources below): if factory $X$ makes 60% of light bulbs at your hardware store, and $X$'s light bulbs last 5,000 hours on average, and factory $Y$ makes the other 40%, with $Y$'s bulbs lasting 4,000 hours on average, then the expected lifetime $L$ of a random bulb purchased would be: $$
{\Bbb E}[L] =
\underbrace{5{,}000 \times .6}_{{\Bbb E}[L|X] \times {\Bbb P}[X]}
+
\underbrace{4{,}000 \times .4}_{{\Bbb E}[L|Y] \times {\Bbb P}[Y]}
= 4{,}600
$$ The formal proof of the law of total expectation just involves manipulations of a double-sum, so I won't include it here.


Complete statistics


Let $X$ be a random variable within a parametric model $\scr{P}$, and let ${\mathbf x} = (x_1, x_2, \dotsc, x_k)$ be a sample of values of $X$. Let $T$ be a statistic, i.e. a function of $\mathbf x$ (thus, below, $T$ will be understood to mean $T({\mathbf x})$).

The statistic $T$ is called complete if, for any (measurable*) function $g$, $$
{\Bbb E}_\theta \left[ g(T) \right] = 0, \forall \theta \in \Theta
\implies
{\Bbb P}_\theta \left[ g(T)=0 \right] = 1, \forall \theta \in \Theta
$$ This is a bulky definition, so let's unpack it. The $\implies$ arrow means that if the statement on the left of the arrow is true, then the statement on the right of the arrow is also true.

The statement on the left is saying that, for every value $\theta$ of the parameter, the expected value of $g(T)$ (under the specific version of the model with $\theta$ as the parameter) is zero, for a (measurable) function $g$.

The statement on the right is saying that, for every value $\theta$ of the parameter, the probability (under the specific version of the model with $\theta$ as the parameter) that $g(T)=0$ is 1, which means that $g$ is the zero function almost everywhere. This means that the set $\lbrace {\bf x} \, | \, g(T({\bf x})) \neq 0 \rbrace$ has probability zero (though it is not necessarily the empty set, an important distinction). When this is the case, $g$ is, for probabilistic purposes, essentially the same as the zero function.

*Note: "measurable" has a specific meaning beyond the scope of this post, but this basically just serves to rule out ridiculous functions. For all intents and purposes, any function you can think of is measurable.

$g(T)$ in this definition should be thought of as an estimator (for some parameter) based on the statistic $T$. In the German Tank Problem example, $T$ could be the max function $T({\mathbf x})=\max (x_i) =m$, and $g(T)$ would be some function of that. If ${\Bbb E}_\theta \left[ g(T) \right] = 0$, we call $g(T)$ an unbiased estimator of zero.

So this definition of completeness is really saying that, $T$ is a complete statistic if the only way to obtain an unbiased estimator of zero from $T$, is to apply an (almost-everywhere) zero function to $T$. Equivalently, any estimator formed by application of a not-a.e. zero function to the complete statistic $T$ will not result in an unbiased estimator of zero.

Note that whether a statistic is complete or not depends on the model: the "$\forall \theta \in \Theta$" part of the definition means that completeness may depend on which values of $\theta$ are, or are not, permitted in the model. For this reason, it is technically more correct to say that the family of probability distributions ${\scr F}_T = \lbrace f_T(t|\theta) \, | \, \theta \in \Theta \rbrace$ is complete, where the $f_T(t| \theta)$ are of course the probability distributions of $T$ for different values of the parameter. However, it is also common to simply say that "the statistic $T$ is complete," and I will use the simpler terminology for the remainder of this post. It is not too difficult to come up with examples in which placing a restriction on $\Theta$ changes the completeness status of a statistic; I'll provide links in Sources that include a few such examples, but in the interest of post length, I won't go into one here.

This rather technical definition of completeness serves an important purpose. As the following Proposition shows, the property of completeness, as defined, guarantees the uniqueness of unbiased estimators.

Proposition 1: If a statistic $T$ is complete, then any unbiased estimator of a parameter $\theta$, based on $T$ (i.e. any estimator $g(T)$ that is a function of $T$), is essentially unique.

More precisely, the term essentially unique means that, if there is another function $h(T)$ that is an unbiased estimator of $\theta$, then $h(T)=g(T)$ almost everywhere. In other words, the set of values of $T$, for which $h \neq g$, has probability zero: ${\Bbb P}\left[ \lbrace t=T({\bf x}) \, | \, h(t) \neq g(t) \rbrace \right]=0$.

Proof: Suppose $g_1(T)$ and $g_2(T)$ are two different estimators based on $T$ that both have expected value $\theta$, i.e. both are unbiased estimators of a parameter $\theta$. Then the function $g(T)=g_1(T)-g_2(T)$ is an unbiased estimator of zero: $$
{\Bbb E}[g(T)] = {\Bbb E}[g_1(T)] - {\Bbb E}[g_2(T)] = \theta - \theta = 0
$$ Since the assumption was that $g = g_1 - g_2$ was not the (a.e.) zero function, this would contradict the fact that $T$ is complete, so we must instead have $g_1(T)=g_2(T)$ almost everywhere. Thus, the unbiased estimator of $\theta$ based on $T$ is essentially unique. $\square$


The sample max in the German Tank Problem is a complete statistic


Because $\Bbb{P}_\theta [x_k = m] = {m-1 \choose k-1}/{\theta \choose k}$ (here, we've replaced the population maximum $N$ with the letter $\theta$ in keeping with the notation from earlier in this post), we also know that, for a function $g$, and using $T$ as the sample maximum statistic: $$
{\Bbb E}_\theta[g(T)] = \sum_{j=1}^{\theta}{g(j) {\Bbb P}_\theta[T=j]} = \sum_{j=1}^{\theta}{g(j) \frac{j-1 \choose k-1}{\theta \choose k}}
$$ Suppose that the above expected value is $0$ for all permitted values of the parameter $\theta$, which, in this model, are $\lbrace 1, 2, 3, \dotsc \rbrace$. Then, in particular, it is $0$ for $\theta = 1$. In that simple case, the sum has only one term, $j=1$, and since the fraction becomes ${0 \choose 0} / {1 \choose 1}=1/1=1$, we must have $g(1)=0$ in order for the whole sum to equal zero.

The expectation must also be $0$ for the next case, $\theta=2$. Plugging in $g(1)=0$ from above, we are left with only one term in the two-term sum. Since the fraction of binomial coefficients is again non-zero for either $k=1$ or $k=2$, we must have $g(2)=0$ by the same reasoning as above.

Continuing with this inductive argument, we obtain $g(j)=0$ for all values of $j$, i.e. $g$ must be the zero function in order to have ${\Bbb E}_\theta[g(T)]=0, \forall \theta \in \Theta$. This proves that the sample maximum $T$ is a complete statistic with respect to the model ${\scr P}$.


Sufficient statistics and the Factorization Theorem


Proposition 1 suggests that, if we can somehow use a complete statistic $T$ to derive a minimum-variance unbiased estimator, then that estimator will be unique. Sufficiency is the property we need in order to actually find an unbiased estimator with minimum variance.

An estimator has variance because different samples yield different values of the estimator. Furthermore, some information in the samples is relevant to the estimation of the parameter, while other information is not. For example, from Part 2, while common sense dictates that the sample maximum in the German Tank Problem should certainly be relevant in estimating the population size, the estimator $N_1=2{\bar x}-1$ is based on the sample mean. Similarly, estimator $N_2=m+(x_1-1)$, though doing better since it's based on $m$, includes a "gap" adjustment based on the lowest serial number in the sample. These facts about the sample are "noise": they don't help us estimate $N$ more accurately.

Indeed, for both $N_1$ and $N_2$, different samples, all of size $k$ and with the same sample maximum $m$, can yield different estimator values. $N_3 = m + (\frac{m}{k}-1)$, on the other hand, depends only on the sample max $m$ and sample size $k$ (recall from Part 2 that the "average gap" adjustment simplified down to depend only on these). While different samples will have different values of $m$, yielding some estimator variance that is unavoidable, if we are given a certain value of $m$, we can have only one value of the estimator $N_3$. Thus, $N_3$ has less "avoidable noise" than the other estimators and should be the least volatile.

The discussion above is the intuition behind the following definition: a statistic $T({\bf x})$ is called sufficient (for the parameter $\theta$) if the conditional probability ${\Bbb P}_\theta[X={\bf x} \, | \, T({\bf x})=t]$ is a function of the value $t$ of the statistic and does not depend on $\theta$.

If $T$ is a sufficient statistic, then the probability that the sample takes on a particular value ${\bf x}$ depends on $\theta$ only through the value $t$ of the statistic. Flipping this around, when we estimate the value of the unknown parameter $\theta$ based on the sample ${\bf x}$, then the value $t=T({\bf x})$ of the statistic already contains all the information from the sample that is relevant to estimating $\theta$. For this reason, a sufficient statistic $T$ is sometimes referred to as a lossless compression of the sample ${\bf x}$.

The following theorem characterizes sufficient statistics.

Fisher-Neyman Factorization Theorem: Let $f_X({\bf x}|\theta)$ be the probability distribution of $X$, given the parameter value $\theta$. A statistic $T(X)$ is sufficient for the parameter $\theta$ if and only if there exist functions $g$ and $h$ such that $$
f_X({\bf x}|\theta) = h({\bf x})g(\theta, T({\bf x}))
$$ In other words, the probability density function of $X$ can be factored into a product such that one factor, $h$, does not depend on $\theta$, and the other factor, which does depend on $\theta$, depends on the sample ${\bf x}$ only through the statistic $t=T({\bf x})$.

This theorem makes some intuitive sense given the discussion above; a thorough reading of the proof helped me understand the symbols in terms of this intuition (and vice-versa). Note that this proof only works for the case of discrete random variables, but the theorem also applies in the case of continuous r.v.'s.

Proof:
Sufficiency $\implies$ factorization:
First, note the trivial fact that, since $T$ is a function of $X$, ($T=T({\bf x})$ and $X={\bf x}$) if and only if $X={\bf x}$. This means that ${\Bbb P}_\theta [X={\bf x}]={\Bbb P}_\theta [X={\bf x} \cap T=T({\bf x})]$, or in terms of the probability mass functions, $f_X({\bf x}|\theta)=f_{X,T(X)}({\bf x},T({\bf x})|\theta)$. Therefore: $$
\begin{align}

f_X({\bf x}|\theta)

&= f_{X,T(X)}({\bf x},T({\bf x})|\theta) \\[2mm]

&\stackrel{(\clubsuit)}{=}
\underbrace{f_{X|T(X)}({\bf x}|T({\bf x}),\theta) }_{h({\bf x})}
\cdot
\underbrace{f_{T({\bf x})}(T({\bf x})|\theta)}_{g(\theta,T({\bf x}))}

\end{align}
$$ The second equality is just equation $(\clubsuit)$ from the intro, written in terms of the pmf's rather than the probabilities. Since $T$ is sufficient, the conditional pmf $f_{X|T(X)}({\bf x}|T({\bf x}))$ is a function of $t=T({\bf x})$ (which, in turn, is a function of ${\bf x}$) and does not depend on $\theta$, so we can write it as $h({\bf x})$. The second pmf, $f_{T({\bf x})}(T({\bf x})|\theta)$, depends only on $t=T({\bf x})$ and $\theta$, so we can write it as $g(\theta,T({\bf x}))$, yielding the desired factorization formula.

Factorization $\implies$ sufficiency:
We turn again to the pmf version of $(\clubsuit)$: $$
f_{X|T(X)}({\bf x}|t,\theta)=\frac{f_{X,T(X)}({\bf x},t|\theta)}{f_{T(X)}(t|\theta)} \, \, \, \, (1)
$$ The numerator on the right-hand side is equal to $f_X({\bf x}|\theta)$ if $t=T({\bf x})$, and zero if $t \neq T({\bf x})$, again since $T$ is a function of $X$. Assuming the factorization formula, $f_X({\bf x}|\theta) \stackrel{(2)}{=} h({\bf x})g(\theta,t)$ when $t=T({\bf x})$.

For the denominator, since $T$ is a function of $X$, we have ${\Bbb P}_\theta[T=t]=\sum{\Bbb P}_\theta[X={\bf \tilde x}]$, where the sum is taken over the values of the dummy variable ${\bf \tilde x}$ for which $T({\bf \tilde x})=t$. Thus: $$
f_{T(X)}(t|\theta)
= \sum_{{\bf \tilde x}: T({\bf \tilde x})=t}{f_X({\bf \tilde x}|\theta)}
\stackrel{(3)}{=} \sum_{{\bf \tilde x}: T({\bf \tilde x})=t}{\left[ h({\bf \tilde x})g(\theta,t) \right]}
$$ where the first equality is the pmf version of the probability statement above, and the second equality follows from the assumption of the factorization formula for $f_X({\bf x}|\theta)$.

Plugging $(2)$ and $(3)$ into $(1)$ yields: $$
\begin{align}
f_{X|T(X)}({\bf x}|t,\theta)

&=\frac
{h({\bf x})g(\theta,t)}
{\sum_{{\bf \tilde x}:T({\bf \tilde x})=t}{\left[ h({\bf \tilde x})g(\theta,t) \right]}} \\[2mm]

&=\frac
{h({\bf x})g(\theta,t)}
{\sum_{{\bf \tilde x}:T({\bf \tilde x})=t}{\left[ h({\bf \tilde x}) \right]g(\theta,t)}}

=\frac
{h({\bf x})}
{\sum_{{\bf \tilde x}:T({\bf \tilde x})=t}{\left[ h({\bf \tilde x}) \right]}}
\end{align}
$$ where we took the $g(\theta,t)$ out of the sum in the second equality since it doesn't depend on ${\bf \tilde x}$. Since the final expression does not depend on $\theta$, we have proved that $T$ is sufficient.
$\square$


The sample max in the German Tank Problem is a sufficient statistic


To show that the sample max is a sufficient statistic, we can use the Factorization Theorem above. Let ${\bf 1}_A$ be the function that equals $1$ if $A$ is true and $0$ otherwise (this is called the indicator function of event $A$), and, for a sample ${\bf x}=(x_1,x_2,\dotsc,x_k)$, let the statistic $T({\bf x})=\max(x_i)$, the sample max.

The probability mass function for the sample $X$ taking the value ${\bf x}=(x_1, x_2, \dotsc, x_k)$, given the population size is $\theta$, is: $$
\begin{align}
f_X({\bf x} \, | \, \theta)

&= \left( \frac{1}{\theta}{\bf 1}_{\lbrace 1 \leq x_1 \leq \theta \rbrace} \right)
\left( \frac{1}{\theta}{\bf 1}_{\lbrace 1 \leq x_2 \leq \theta \rbrace} \right)
\dotsb
\left( \frac{1}{\theta}{\bf 1}_{\lbrace 1 \leq x_k \leq \theta \rbrace} \right) \\[2mm]

&= \frac{1}{\theta^n} {\bf 1}_{\lbrace 1 \leq \min(x_i) \rbrace} {\bf 1}_{\lbrace \max(x_i) \leq \theta \rbrace} \\[2mm]

&= \underbrace{
\vphantom{\frac{1}{\theta^n}}
{\bf 1}_{\lbrace 1 \leq \min(x_i) \rbrace}
}_{h({\bf x})}
\,
\underbrace{\frac{1}{\theta^n} {\bf 1}_{\lbrace T({\bf x}) \leq \theta \rbrace}}_{g(\theta,T({\bf x}))}

\end{align}
$$ Thus, by the Factorization Theorem, the sample max $T(X)$ is a sufficient statistic for the population size $\theta$.


The Rao-Blackwell and Lehmann-Scheffé Theorems


The discussion above suggests that completeness and sufficiency should help us obtain a unique estimator with minimal variance. The Rao-Blackwell and Lehmann-Scheffé Theorems prove this definitively. The first of these two gives us a process to reduce the variance of an estimator:

Clockwise from top-left: Rao, Blackwell, Scheffé, Lehmann
All four images courtesy of Wikipedia.

Rao-Blackwell Theorem:
Let $U=u(X)$ be an unbiased estimator of $\theta$, and let $T=t(X)$ be a sufficient statistic for $\theta$. Then the Rao-Blackwell estimator $$
U^*=u^*({\bf x}) = {\Bbb E}_{\theta}[U \, | \, T=t({\bf x})]
$$ is also unbiased, and its variance is less than that of $U$.

In the proof below, all ${\Bbb E}$'s are understood to stand for ${\Bbb E}_{\theta}$.

Proof: By the definition of sufficiency given above, $u^*({\bf x})$ is indeed a function of ${\bf x}$, since the conditional expectation of $U$, given the sufficient statistic $T$, does not depend on $\theta$. Also, $U^*$ is unbiased by the law of total expectation and the fact that $U$ is unbiased: $$
{\Bbb E}[U^*] = {\Bbb E}[{\Bbb E}[U|T]] = {\Bbb E}[U] = \theta
$$ Furthermore, the variance of $U^*$ is less than that of $U$: $$
\begin{align}
{\rm var}[U^*]

&= {\Bbb E}\left[ \left( U^* - {\Bbb E}[U^*] \vphantom{{\Bbb E}[U^*]^2} \right)^2 \right]  \\[2mm]

&= {\Bbb E}\left[ \left( {\Bbb E}[U|T] - \theta \vphantom{{\Bbb E}[U^*]^2} \right)^2 \right] \\[2mm]

&= {\Bbb E}\left[ \left( {\Bbb E}[U-  \theta \, | \, T] \vphantom{{\Bbb E}[U^*]^2} \right)^2 \right] \tag{3}\\[2mm]


&\leq {\Bbb E}\left[ {\Bbb E}\left[ \left( U-  \theta \vphantom{{\Bbb E}[U^*]^2} \right)^2 \, \middle| \, T \right] \right] \tag{4} \\[2mm]

&= {\Bbb E}\left[ \left( U-  \theta \vphantom{{\Bbb E}[U^*]^2} \right)^2 \right] \tag{5} \\[2mm]

&= {\Bbb E}\left[ \left( U-  {\Bbb E}[U] \vphantom{{\Bbb E}[U^*]^2} \right)^2 \right] \\[2mm]

&= {\rm var}[U]

\end{align}
$$ The third equality is true because $\theta$ is a constant (and ${\Bbb E}$ is linear), the inequality in line 4 is due to Jensen's inequality, and the equality in line 5 is the law of total expectation. $\square$

Rao-Blackwell is telling us that if we supply an unbiased estimator, we can condition on a sufficient statistic to obtain a new unbiased estimator with lower variance. This sounds like a good process to source MVUE candidates. The Lehmann-Scheffé Theorem is the last piece of the puzzle, and we need one more Proposition to prove it.

Proposition 2: If an MVUE exists, it is essentially unique. That is, if $U$ and $U'$ are two MVUE's, then $U=U'$ almost everywhere.

Proof: For random variables $X$ and $Y$, and constants $c, c_1, c_2$, we have the following variance and covariance formulas (which I won't derive in detail here): $$
\begin{align}
{\rm var}[X+Y] &= {\rm var}[X]+{\rm var}[Y] + 2{\rm cov}(X,Y) \tag{1}\\[2mm]
{\rm var}[cY] &= c^2 {\rm var}[Y] \tag{2} \\[2mm]
{\rm cov}(c_1 X,c_2 Y) &= c_1 c_2 {\rm cov}(X,Y) \tag{3}\\
\end{align}
$$ So, let $U$ be an MVUE and $U'$ another MVUE. Since $U$ and $U'$ both have expected value $\theta$, their difference has expected value zero. Thus, if we can show that ${\rm var}[U'-U]=0$, that would mean that $U'-U$ is a constant almost everywhere, and thus $U'=U$ almost everywhere. Using the formulas above, for any constant $c$, we have: $$
{\rm var}[U+c(U'-U)] = {\rm var}[U] + c^2 {\rm var}[U'-U] + 2c \, {\rm cov}(U, U'-U) \, \,(\spadesuit)
$$ Assume ${\rm var}[U'-U] = \gamma >0$ (we will show this leads to a contradiction), and let $\rho = -{\rm cov}(U,U'-U)$. Equation $(\spadesuit)$ must be true for any value of $c$, so let's plug in $c=\rho / \gamma$: $$
\begin{align}
{\rm var}[U+c(U'-U)]

&= {\rm var}[U] + \left( \frac{\rho}{\gamma} \right)^2 \gamma -2\left( \frac{\rho}{\gamma} \right) \rho \\[2mm]

&= {\rm var}[U] - \rho^2 / \gamma
\end{align}
$$ Since $U$ is an MVUE, the variance of $U+c(U'-U)$ can't be less than that of $U$. Thus, since $\rho^2 / \gamma \geq 0$, we must have $\rho^2 / \gamma = 0 \implies \rho = -{\rm cov}(U,U'-U)=0$.

Plugging this into $(\spadesuit)$, and now using the value $c=1$, we obtain: $$
\begin{align}
{\rm var}[U+1 \cdot (U'-U)] &= {\rm var}[U] + 1^2 \cdot {\rm var}[U'-U] + 0 \\[2mm]
{\rm var}[U'] &= {\rm var}[U] + \gamma
\end{align}
$$ But $U'$ is an MVUE, so its variance cannot be greater than that of $U$, and so $\gamma = 0$. This is a contradiction, so we must have ${\rm var}[U'-U]=0$, and thus $U'=U$ almost everywhere. $\square$

The proof of the following theorem will now be nice and easy, based on the discussion above.

Lehmann-Scheffé Theorem: If $T$ is a complete and sufficient statistic for a parameter $\theta$, and $U$ is an unbiased estimator of $\theta$, then $U^*={\Bbb E}[U|T]$ is the unique MVUE for $\theta$.

Proof: Firstly, let's break down the notation. Recall from the intro section that $U^*={\Bbb E}[U|T]$ is indeed a function of the value of $T$. Indeed, if ${\bf x}$ is the value of our sample, then the statistic $T$ has the value $t=T({\bf x})$, and $U^*({\bf x}) = g(t) = {\Bbb E}[U \, | \, T({\bf x})=t]$. Recall that we can write this in the simpler form $U^*=g(T)$, with the dependence on the sample ${\bf x}$ implied.

By Rao-Blackwell, we know that (1) $U^*$ is unbiased (since $U$ is unbiased, and $T$ is sufficient), and (2) ${\rm var}[U^*] \leq {\rm var}[U]$. Thus, $U^*$ is an MVUE candidate. Now, let $V$ be another MVUE candidate. We will prove that $V=U^*$ almost everywhere.

Define $V^* = {\Bbb E}[V|T] := h(T)$. Also by Rao-Blackwell, $V^*$ is unbiased and has variance less than or equal to that of $V$. But since $V$ is an MVUE, ${\rm var}[V^*] \nless {\rm var}[V]$, so we must have ${\rm var}[V^*]={\rm var}[V]$, and $V^*$ is also an MVUE. By Proposition 2, the MVUE is essentially unique, so $V^*=V$ almost everywhere.

Furthermore, by Proposition 1, any unbiased estimator that is a function of $T$ is essentially unique. Thus, the following is true (almost everywhere): $$
V = V^* = h(T) = g(T) = U^*
$$ Thus, $U^*$ is the unique MVUE of $\theta$. $\square$

All of the above theorems and propositions allow us to conclude that all of the following are true:

  • If we can find an unbiased estimator $U$ of $\theta$ as well as a complete and sufficient statistic $T$, then the Rao-Blackwell estimator $U^* = {\Bbb E}[U|T]$ is the MVUE of $\theta$ (Lehmann-Scheffé). The MVUE is a function of $T$.
  • This function of $T$ is essentially unique (Proposition 1).
  • An MVUE is essentially unique (Proposition 2). This means that, regardless of which complete and sufficient statistic $T$ we use to Rao-Blackwellize, we will obtain the same MVUE.
  • Similarly, we will obtain the same MVUE regardless of which unbiased estimator $U$ we start with.
The various sources listed below all come to these conclusions, often in different orders or stating some a bit differently. In Sources below, I include some commentary on the approaches of the respective authors.

Finally, turning back to the unbiased estimator $N_3$ from the German Tank Problem, it is now easy to see why it is the MVUE. We proved above that $T({\bf x})=\max(x_i)=m$ is complete and sufficient. As $N_3$ only depends on $m$, conditioning on $T$ obviously adds no additional information, so $N_3^* = {\Bbb E}[N_3 \, | \, T = m] = N_3$. By the Lehmann-Scheffé Theorem, $N_3$ is thus the unique MVUE for the population size $\theta$. An even simpler explanation is that $N_3$ is an unbiased function of the complete, sufficient statistic $T$. Since the MVUE is the unique unbiased function of $T$, $N_3$ is the MVUE.

This post was a bit more on the technical side, but I do hope folks enjoyed seeing the rigorous logic needed to prove that $N_3$ is the MVUE, as well as the heuristic observations / intuition relating to each step. I hope to have more content for you soon, and as always, I welcome any reader requests for topics. Until then, thanks for reading.


Sources


Wikipedia:
Law of Total Expectation
Rao-Blackwell Theorem
Lehmann-Scheffé Theorem
Note: I believe the proof of the Lehmann-Scheffé Theorem in the Wikipedia is missing one step: namely, it proves that the candidate MVUE is the unique function of the complete-sufficient statistic that could be the MVUE, but it never proves that the MVUE must indeed be a function of the complete-sufficient statistic. The other sources below, as well as the proof I included above, remedy this omission. If I'm missing something in the Wikipedia proof, I'd appreciate if someone would let me know in the comments.

FSU lecture notes (Debdeep Pati):
Complete Statistics
Note: These lecture notes contain numerous examples in which changing the parameter space $\Theta$ changes a statistic's completeness status. I promised above that I would include a source with such examples, since I did not include any in the post itself.

Stack Exchange:
A thread on the intuition behind the definition of completeness
Note: this thread also contains an example of a change in the completeness status of a statistic based on a change in $\Theta$.

OU lecture notes (Christian Remling):
Sufficiency
Completeness and Sufficiency
These lecture notes offer a thorough treatment of the topics with numerous examples other than just the German Tank Problem (e.g. the coin flip model). The notes also contain the continuous analog of the GTP, in which proving completeness and sufficiency of the sample max requires calculus.

Arizona lecture notes (Joe Watkins):
Sufficient Statistics
Another thorough treatment, I found Watkins's notation for the pmf's easiest to follow in these notes and thus adopted that notation throughout this post. I also followed his proof of the Fisher-Neyman Factorization Theorem, with a bit more explanation added in for my readers.

Oxford lecture notes (Steffen Lauritzen):
Sufficiency and Unbiased Estimation
In reaching the results of the Lehman-Scheffé Theorem, these notes squeeze as much as possible out of sufficiency before finally invoking completeness. I used Lauritzen's proofs of (1) the Rao-Blackwell Theorem, (2) the essential uniqueness of the MVUE, and (3) the fact that the MVUE must be a function of a sufficient statistic (the latter being rolled up into my proof of the Lehmann-Scheffé Theorem, which is the same way Remling handled it).

Parameter Estimation - Part 2 (German Tank Problem)

Preliminaries: Parameter Estimation - Part 1 (Reader Request)

Part 1 of this post introduced the concept of calculating statistics based on sample data and using them to estimate the unknown values of population parameters. As an estimator's predictive quality depends on both its bias and its variance, unbiasedness and low variance are clearly desirable qualities. In this post, we will derive and compare various estimators in the context of a concrete example, the German Tank Problem.


Historical background


During WWII, the Economic Warfare Division of the US Embassy in London utilized a statistical method to estimate the monthly production rate of German tanks (among other equipment). While standard intelligence estimates (based, for example, on widely divergent POW reports or projections based on pre-war production capacity) amounted to 1,400 tanks per month between June 1940 and September 1942, the statistical analysis yielded a far lower figure: 256 per month. After the war, captured government records revealed the actual figure to be 255.

This statistical analysis was based on serial number markings on parts of captured or destroyed tanks. For example, the serial numbers on various wheels yielded an estimate of the total number of wheel molds in use, and British road-wheel manufacturers provided data on the number of wheels that could be produced from a single mold. Thus, if there are 32 wheels per tank, the estimated number of molds in use, times the number of wheels produced per mold per month, divided by 32, would be an estimate for the number of tanks produced in a given month.

Analysis of the 64 wheels from two captured tanks produced an estimate of 270 tanks produced in February 1944, while government records indicated 276 tanks for that month. The statistical method was thus quite effective, even based on a relatively small number of captured tanks. The Allied statisticians also analyzed wheels, gearboxes, chassis, and engines in order to cross-check their results.

Ignoring the particulars of wheel molds and serial numbering conventions, the mathematical statement of the estimation problem now known as the German Tank Problem is quite simple: suppose we have a population of $N$ items with labels $1,2, \dotsc, N$. Given a random sample of $k$ items with labels $x_1, x_2, \dotsc, x_k$, how do we estimate the population size $N$?

We would like an estimator for $N$ which is unbiased and has low variance. Without loss of generality, we can assume that the labels are in order, i.e. $x_1 < x_2 < \dotsb < x_k$. Furthermore, we define the sample maximum $m := x_k$.


Sample maximum: a crude estimator


One obvious candidate for an estimator of $N$ is simply $m$. However, common sense dictates that this estimate must be biased downward, as $m$ cannot exceed $N$. Thus, the expected value of $m$ should be less than or equal to $N$.

Indeed, the probability of obtaining a sample with maximum $m$ is the number of ways to choose a sample with maximum $m$ divided by the number of possible samples. To choose a sample with maximum $m$, we must choose $m$ (there is only one way to do this) as well as $k-1$ additional sample values from the set $\lbrace 1, 2, \dotsc, m-1 \rbrace$ in one of ${m-1} \choose {k-1}$ ways, so the probability is $$
{\Bbb P}\left[ x_k = m \right] = \frac{{m-1}\choose{k-1}}{{N}\choose{k}}
$$ and the expected value of $m$ is: $$
\begin{align}
{\Bbb E}[m]
&= \sum_{j=k}^{N}{j \frac{{j-1}\choose{k-1}}{{N}\choose{k}}} \\[2mm]
&= \frac{1}{N \choose k} \sum_{j=k}^{N}{j \frac{(j-1)!}{((j-1)-(k-1))! (k-1)!}} \\[2mm]
&= \frac{1}{(k-1)!{N \choose k}} \sum_{j=k}^{N}{\frac{j!}{(j-k)!}} \\[2mm]
&= \frac{k!}{(k-1)!{N \choose k}} \sum_{j=k}^{N}{j \choose k} \\[2mm]
&\buildrel{(*)} \over{=} k \frac{{N+1}\choose{k+1}}{N \choose k} \\[2mm]
&= \frac{k}{k+1}(N+1)
\end{align}
$$ which does not equal $N$, except when $k=N$. Thus, $m$ is a biased estimator of $N$. The bias does become small when the sample size is a significant proportion of $N$, but this is not usually the case in practice.

$(*)$ Note: this equality follows from the Hockey-stick Identity for binomial coefficients: $\sum_{j=k}^{N}{j \choose k} = {{N+1} \choose {k+1}}$

Expected value of $m$ for various values of sample size $k$ (with $N=250$).
Notice that the bias is sizable until $k$ is ~20-30, i.e. 8-12% of $N$.


A few unbiased estimators


We can employ a few common sense observations to derive unbiased estimators of $N$.

Firstly, let $\psi$ be the middle value of the list $1, 2, \dotsc, N$, i.e. $\psi = \frac{1}{2}(N+1)$ (this formula works whether $N$ is odd or even). Then $N=2\psi-1$, and the sample mean $\overline{x} = \frac{1}{k}\sum_{i=1}^{k}{x_i}$ is an unbiased estimator of $\psi$ (you can prove this using induction, for example). This, in turn, implies that $N_1 := 2\overline{x}-1$ is an unbiased estimator of $N$.

The issue with $N_1$ is that it can be less than the sample maximum $m$. For example, if we have a sample of size $k=3$ with $x_1 = 2, x_2 = 3, x_3=10$, then $\overline{x}=5$, and $N_1=9$, but $m=10$. This is an undesirable estimator as we know for sure that $N \geq 10$.

A more sophisticated method for deriving estimates revolves around analysis of the expected "gaps" between observations in the sample. For example, we may expect that, on average, the sample observations $x_1, x_2, \dotsc, x_k$ will be distributed roughly evenly throughout the interval $1, 2, \dotsc, N$. Therefore, we expect the number of unobserved labels below $x_1$ (which is observable from our sample, as the lowest label is $1$) to be equal, on average, to the number of unobserved labels above $m=x_k$ (which is not observable, as $N$ is unknown). Thus, we define our second estimator to be $N_2 :=  m + (x_1 - 1)$.

We can further refine $N_2$ by replacing the "gap" $(x_1 - 1)$ with the average of all the gaps in our sample, thus more fully utilizing the sample data. The average gap size in the sample is $$
\begin{align}
G_{\text{avg}}

&= \frac{1}{k} \left[ (x_1 - 1) + (x_2 - x_1 - 1)
+ \dotsb + (x_k - x_{k-1} - 1) \right] \\[2mm]

&= \frac{1}{k} \left[ x_k - k \right] \\[2mm]

&= \frac{m}{k}-1
\end{align}
$$ Consequently, we define our third estimator to be $N_3 := m + G_{\text{avg}} = m + \left( \frac{m}{k} - 1\right)$. Clearly, $N_2, N_3 \geq m$, so these estimators cannot exhibit the undesirable quality of taking values less than $m$.

Furthermore, $$
\begin{align}
{\Bbb E}[N_3]
&= {\Bbb E}\left[ m + \left( \frac{m}{k} - 1 \right) \right] \\[2mm]
&= {\Bbb E}\left[ m \left( 1+k^{-1} \right) - 1 \right] \\[2mm]
&= \left( 1+k^{-1} \right) {\Bbb E}\left[ m \right] - 1 \\[2mm]
&= \left( 1+k^{-1} \right) \cdot \frac{k}{k+1}(N+1) - 1 \\[2mm]
&= \frac{k+1}{k} \cdot \frac{k}{k+1}(N+1) - 1 \\[2mm]
&= N
\end{align}
$$ so $N_3$ is an unbiased estimator. The proof that $N_2$ is also unbiased follows a similar line of reasoning.


Variance comparison


Below is a comparison of the variances of the three unbiased estimators $N_1, N_2, N_3$ defined above. As I intend this post to be more practical than technical, I will avoid a lengthy computation involving binomial coefficients and simply state the variances below without proof.

Note: the Wikipedia article on the Discrete uniform distribution contains the full derivation of the variance formula for $N_3$; this derivation uses the formulas ${\rm Var}[N_3] = \frac{(k+1)^2}{k^2} {\rm Var}[m]$ and ${\rm Var}[m] = {\Bbb E}[m^2] - ({\Bbb E}[m])^2$. The computation of ${\Bbb E}[m^2]$ is, of course, the lengthy part but is ultimately similar to our derivation of ${\Bbb E}[m]$ above.

The variance formulas are:

  • ${\rm Var}[N_1] = \frac{(k+2)}{3k} \frac{(N-k)(N+1)}{(k+2)}$
  • ${\rm Var}[N_2] = \frac{2}{(k+1)} \frac{(N-k)(N+1)}{(k+2)}$
  • ${\rm Var}[N_3] =  \frac{1}{k} \frac{(N-k)(N+1)}{(k+2)}$
while the standard deviations, which have more recognizable scales (i.e. number of tanks instead of squared number of tanks), are the square roots of respective variances. Let's see how the estimators perform for various sample sizes:



We can see that $N_3$ has the lowest variance of the three for all sample sizes. In the legend, I have marked $N_3$ as "MVUE", which stands for minimum variance unbiased estimator. This means that $N_3$ actually has the lowest variance of any unbiased estimator for $N$, not just the three above.

In Part 3 of this post, I will introduce the mathematical heavy machinery (sorry for the pun) to prove that $N_3$ is indeed the MVUE.

Thanks for reading.


Sources


Wikipedia: German Tank Problem
Wikipedia: Discrete uniform distribution
Estimating the Size of a Population (Roger Johnson) - a paper introducing the unbiased estimators above.
An Empirical Approach to Economic Intelligence in World War II (Richard Ruggles and Henry Brodie, 1947) - the original paper describing the use of these statistical methods during WWII.

Parameter Estimation - Part 1 (Reader Request)

Preliminaries: How do mathematicians model randomness?, Monte Carlo Simulation - Part 2, Proof of the Law of Large Numbers

The preliminary post How do mathematicians model randomness? introduced random variables, their probability distributions, and parameters thereof (namely, mean, variance, and standard deviation). This post, the response to a reader request from Anonymous, will cover estimation of parameters based on random sampling. I will explain the difference between parameters and statistics, introduce the concept of estimator bias, and address the reader request's specific question about unbiased estimation of standard deviation.

In Part 2 of this post, I will present a well known historical application of parameter estimation, the German Tank Problem, and compare methods of estimating an unknown population size. Finally, in Part 3, I will introduce complete and sufficient statistics, which allow us to prove that the best estimator among the candidates in Part 2 is the unique minimum variance unbiased estimator.


Parameters and statistics


Recall from the first preliminary post that we model random phenomena with random variables and their probability distributions. Key characteristics of these distributions (mean, variance, standard deviation, etc.) are called parameters.

Parameters are defined based on all possible values of a random variable (the population) weighted by their relative frequencies of occurrence (i.e. their probabilities). For example, the mean (denoted $\mu_{X}$, ${\Bbb E}(X)$, or simply $\mu$ when there is no ambiguity as to the underlying random variable) of an outcome of interest (i.e. random variable) $X$ is defined as $$
\mu = \sum_{i}{x_i {\Bbb P}(x_i)}
$$ where each $x_i$ is one of the possible values of $X$ and the sum runs over all possible values (in the continuous case, the sum would be replaced by an integral).

In practice, we do not have access to all possible values of $X$ and their probabilities. Instead, we typically have a sample of observed values, $x_1, x_2, \dotsc, x_n$. Given such a sample, we could estimate $\mu$ using the sample mean $$
\bar{x} = \frac{1}{n}\sum_{i=1}^{n}{x_i}
$$ The sample mean is a statistic, a value calculated based on sample data, which can be used to estimate the (unknown) parameter value. Notice that $\bar{x}$ can take on different values depending on which random sample we used to calculate it. In other words, $\bar{x}$ is itself a random variable.


Estimator bias, Bessel's correction


As random variables, statistics have their own probability distributions, known as sampling distributions, and thus their own means, variances, standard deviations, etc. We actually already touched upon this fact in the earlier posts Monte Carlo Simulation - Part 2 and Proof of the Law of Large Numbers, in which we proved that the sample mean $\bar{x}$ has expected value $\mu$ and variance $\frac{\sigma^2}{n}$ (a fact that we will use below).

Since ${\Bbb E}(\bar{x}) = \mu$, we say that the sample mean is an unbiased estimator of the population mean $\mu$. By the same logic, we may estimate the population variance $\sigma^2 = \sum_{i}{(x_i-\mu)^{2}{\Bbb P}(x_i)}$ using the statistic $$
s^2_1= \frac{1}{n} \sum_{i=1}^{n}{(x_i-\bar{x})^2}
$$ However, this statistic is not an unbiased estimator of $\sigma^2$. In order to see why this is the case, we can compute the expected value ${\Bbb E}(\sigma^2 - s_1^2)$, which would be zero if $s_1^2$ were unbiased: $$
\begin{align}
{\Bbb E} \left[ \sigma^2 - s_1^2 \right]

&= {\Bbb E} \left[ \frac{1}{n}\sum_{i=1}^{n}{(x_i - \mu)^2} - \frac{1}{n}\sum_{i=1}^{n}{(x_i-\bar{x})^2}\right] \\[2mm]

&= \frac{1}{n}{\Bbb E}\left[
\sum_{i=1}^{n}{\left(
\left( x_i^2 - 2 x_i \mu + \mu^2) - (x_i^2 - 2 x_i \bar{x} + \bar{x}^2 \right)
\right)} \right] \\[2mm]

&= {\Bbb E}\left[
\mu^2 - \bar{x}^2 + \frac{1}{n}\sum_{i=1}^{n}{(2x_i (\bar{x}-\mu))}
\right] \\[2mm]

&= {\Bbb E}\left[ \mu^2 - \bar{x}^2 + 2\bar{x}(\bar{x}-\mu) \right] \\[2mm]

&= {\Bbb E}\left[ \mu^2 - 2\bar{x}\mu + \bar{x}^2 \right] \\[2mm]

&= {\Bbb E} \left[ (\bar{x} - \mu)^2 \right] \\[2mm]

&= \rm{Var}(\bar{x}) \\[2mm]

&= \frac{\sigma^2}{n}

\end{align}
$$ Since ${\Bbb E}[\sigma^2 - s_1^2] = {\Bbb E}[\sigma^2] - {\Bbb E}[s_1^2] = \sigma^2 - {\Bbb E}[s_1^2]$, the above implies that $$
{\Bbb E}[s_1^2] = \sigma^2 - \frac{\sigma^2}{n} = \frac{n-1}{n}\sigma^2
$$ Therefore, the statistic $s^2 = \frac{n}{n-1}s_1^2 = \frac{1}{n-1}\sum_{i=1}^{n}{(x_i-\bar{x})^2}$, known as the sample variance, has expected value $\sigma^2$ and is thus an unbiased estimator of the population variance.

The replacement of $\frac{1}{n}$ with $\frac{1}{n-1}$ in the sample variance formula is known as Bessel's correction. The derivation above shows that the bias in $s_1^2$ arises due to the fact that $(x_i-\bar{x})$ underestimates the actual quantity of interest, $(x_i-\mu)$, by $(\bar{x}-\mu)$ for each $x_i$. Therefore, the bias is the variance of $\bar{x}$, which we proved to be $\frac{\sigma^2}{n}$ in Proof of the Law of Large Numbers. Using $s^2$ instead of $s_1^2$ corrects for this bias.


Estimation of the standard deviation


Given $s^2$ is an unbiased estimator of $\sigma^2$, we may expect that the sample standard deviation $s=\sqrt{s^2}$ would also be an unbiased estimator of the population standard deviation $\sigma$. However, $$
\begin{align}
{\Bbb E}\left[ s^2 \right] &= \sigma^2 \\[2mm]
\Rightarrow {\Bbb E}\left[ \sqrt{s^2} \right] &< \sqrt{{\Bbb E}\left[ s^2 \right]} = \sqrt{\sigma^2} = \sigma
\end{align}
$$ where the inequality follows from Jensen's inequality and the fact that the square root is a concave function (since the area above it is concave, not convex). In other words, $s$ underestimates $\sigma$ on average.

Unfortunately, for estimating the population standard deviation, there is no easy correction as there is for the variance. The size of the necessary correction depends on the distribution of the underlying random variable. For the Normal distribution, there is a complicated exact formula, but simply replacing the $n-1$ in the denominator with $n-1.5$ eliminates most of the bias (with the remaining bias decreasing with increasing sample size). A further adjustment is possible for other distributions and depends on the excess kurtosis, a measure of the "heavy-tailedness" of the distribution in excess of that of the Normal distribution.

While the specific corrections are beyond the scope of this post, for the brave, there is an entire Wikipedia article dedicated to exactly this topic.


Other measures of estimator quality


Zero bias is certainly a desirable quality for a statistic to have, but an estimator's quality depends on more than just its expected value. A statistic's variance tells us how large of a spread (from its expected value) we may expect when calculating the statistic based on various samples. Just as a statistic with large bias is not particularly helpful, neither is one with no bias but a large variance.


The notion of consistency ties bias and variance together nicely: a consistent estimator is one which converges in probability to the population parameter. This means that, as $n \rightarrow \infty$, the probability of an error greater than some specified amount $\epsilon$ approaches zero. This further implies that both the bias and the variance tend to zero as the sample size grows.

For example, the (weak) law of large numbers implies that $\bar{x}$ is a consistent estimator of $\mu$, as $\lim_{n \rightarrow \infty}{{\Bbb P} \left[ \left| \bar{x} - \mu \right| \geq \epsilon \right]} = 0$ for any $\epsilon > 0$. Furthermore, $s_1^2$ and $s^2$ are both consistent estimators of $\sigma^2$, while $s$ is a consistent estimator of $\sigma$. These examples show that both biased and unbiased estimators can be consistent.

That will do it for Part 1 of this post. Thanks for reading, and look out for Parts 2 and 3, coming up soon. Thanks to Anonymous for the great reader request.


Sources:


Wikipedia- unbiased estimation of standard deviation
Wikipedia - Bessel's correction
Quora post- estimator bias vs. variance

Proof of the Law of Large Numbers

Preliminaries: Monte Carlo Simulation - Part 2, How do mathematicians model randomness?

In Monte Carlo Simulation - Part 2, I introduced the Weak and Strong Laws of Large Numbers as well as the Central Limit Theorem. While the proof of the latter is a bit more technical, I wanted to include a rather intuitive proof of the (Weak) Law of Large Numbers.

The LLN essentially states that, given a sufficiently large sample size, a random sample mean is unlikely to fall far away from the long-run mean. Thus, it should come as no surprise that the standard proof consists of taking the limit as $n \rightarrow \infty$ of an inequality limiting the likely "spread" of a random variable from its mean.

The Law of Large Numbers states that, for large sample sizes,
sample means will likely converge to the theoretical mean.
(image courtesy of Wikipedia)


Below, I'll prove two random variable inequalities called Markov's Inequality and Chebyshev's Inequality and then show you how the LLN falls right out of the latter when we take a limit.

Markov's Inequality: If $X$ is a non-negative random variable, then the probability that $X$ exceeds a positive value $a$ is no greater than the mean of $X$ divided by $a$. In symbols: $$
{\Bbb P}(X \geq a) \leq \frac{{\Bbb E}(X)}{a}
$$ Proof: Let $I$ be the indicator random variable for the event $X \geq a$. In other words, $I$ is a random variable which equals 1 if $X \geq a$ and 0 otherwise. Then it is always the case that $aI \leq X$:

  • If $X<a$, then $I=0$, so $aI=0 \leq X$, since $X$ is non-negative.
  • If $X \geq a$, then $I=1$, so $aI = a \cdot 1 = a \leq X$ since $X \geq a$.
Since $aI \leq X$ is always true, the means must satisfy the inequality ${\Bbb E}(aI) \leq {\Bbb E}(X) \ ( \spadesuit )$. But the left side of this inequality is equal to: 
$$
\begin{align}
{\Bbb E}(aI) &= a {\Bbb E}(I) \\
&= a(1\cdot {\Bbb P}(I=1) + 0 \cdot {\Bbb P}(I=0)) \\
&= a {\Bbb P}(I=1) \\
&= a {\Bbb P}(X \geq a) \tag{$\dagger$}
\end{align}
$$ Substituting $( \dagger )$ into $( \spadesuit )$ gives
$$ a {\Bbb P}(X \geq a) \leq {\Bbb E}(X) $$ Dividing both sides by $a$ yields the desired inequality.
$\square$

Chebyshev's Inequality: Let $X$ be a random variable with mean $\mu$ and non-zero, finite variance $\sigma^2$, and let $k \geq 1$. Then the probability is at most $\frac{1}{k^2}$ that $X$ falls at least $k$ standard deviations from its mean. In symbols: $$
{\Bbb P}( | X - \mu | \geq k \sigma ) \leq \frac{1}{k^2}
$$ Chebyshev's inequality puts a maximum on how frequently a random variable will exhibit values far away from its mean.

Proof: $|X - \mu| \geq k \sigma$ if and only if $(X-\mu)^2 \geq (k \sigma)^2$. Since $(X-\mu)^2$ is non-negative, we can apply Markov's inequality with $a = (k \sigma)^2$: $$
{\Bbb P}\left[ (X- \mu)^2 \geq (k \sigma)^2 \right] \leq \frac{{\Bbb E}\left[ (X-\mu)^2 \right]}{(k \sigma)^2}
$$ But ${\Bbb E}\left[ (X-\mu)^2 \right]$ is the definition of the variance $\sigma^2$, so $$
{\Bbb P} \left[ (X- \mu)^2 \geq (k \sigma)^2 \right] \leq \frac{\sigma^2}{(k \sigma)^2} = \frac{1}{k^2}
$$ as desired.
$\square$

Now, we can apply Chebyshev's inequality to the sample mean $\overline{X}_n$ to obtain the Weak Law of Large Numbers. For the below, let $X_1, X_2, X_3, \dotsc$ be iid random variables with mean $\mu$ and finite variance $\sigma^2$.

Proof of the (Weak) Law of Large Numbers: In Monte Carlo Simulation - Part 2, we saw that the expected value of the sample mean is the long-run mean, i.e. ${\Bbb E}(\overline{X}_n) = \mu$. Furthermore, the variance of $\overline{X}_n$ is $$
\begin{align}
{\rm var} \left( \overline{X}_n \right) &= {\rm var} \left(\frac{1}{n} \sum_{i=1}^{n}{X_i} \right) \\[2mm]

&= \frac{1}{n^2} \sum_{i=1}^{n}{{\rm var}(X_i)} \\[2mm]

&= \frac{n \sigma^2}{n^2} = \frac{\sigma^2}{n}
\end{align}
$$ Now, let $\epsilon > 0$. Then $$
\begin{align}
k \sigma_{\overline{X}_n} = \epsilon
&\iff k^2 {\rm var} \left( \overline{X}_n \right) = \epsilon^2 \\[2mm]

&\iff \frac{1}{k^2} = \frac{{\rm var}\left( \overline{X}_n \right)}{\epsilon^2} = \frac{\sigma^2}{n \epsilon^2}
\end{align}
$$ Thus, by Chebyshev's inequality, $$
{\Bbb P}\left( \left| \overline{X}_n - \mu \right| \geq \epsilon \right)
\leq \frac{1}{k^2}
= \frac{\sigma^2}{n \epsilon^2}
$$ Taking the limit of both sides yields $$
\lim_{n \rightarrow \infty}{\Bbb P}\left( \left| \overline{X}_n - \mu \right| \geq \epsilon \right)
\leq \lim_{n \rightarrow \infty}{\frac{\sigma^2}{n \epsilon^2}} = 0
$$ Finally, since probabilities are non-negative, the limit must be equal to zero.
$\square$

I hope this post helped clarify what the Law of Large Numbers says and why it's true. Thanks for reading, and feel free to ask any questions in the Comments section.

Monte Carlo Simulation - Part 2

Preliminaries: How do mathematicians model randomness?, Monte Carlo Simulation - Part 1

In this post, a continuation of the reader-requested Monte Carlo Simulation - Part 1, I will present another application of Monte Carlo methods and a concrete example (the promised $\pi$ approximation), including the C++ source code so that you can see the "guts" of a Monte Carlo simulation.

I will also explain how the $\pi$ example relates to the Law of Large Numbers and Central Limit Theorem, key probabilistic results which apply beyond just Monte Carlo methods.


Monte Carlo approximation of definite integrals (and $\pi$)


In Part 1, I presented a retirement planning asset allocation problem as an example of when one may employ a Monte Carlo method. Given the enormous number of possible scenarios under the many statistical assumptions, this problem was a prime Monte Carlo candidate.

I will now show you another use of Monte Carlo methods: estimating the value of a definite integral $\int_{a}^{b}{f(x) \, dx}$, i.e. the area under a curve $f(x)$ between $x=a$ and $x=b$, when we cannot calculate it analytically (via the antiderivative). This is especially useful for multidimensional integrals, but I will stick to single-dimensional integrals here just to illustrate the method.

For example, suppose we would like to estimate the value of the definite integral $I=\int_{0}^{\pi}{\frac{\sin(x)}{x} \, dx}$. We have a computer choose many random sample points (from the uniform distribution on the rectangle $\left[ 0, \pi \right] \times \left[ 0,1 \right]$) and count how many fall below the curve (the green points in the below image, courtesy of Google):


In the image above, the computer has chosen 314,028 points so far, of which 185,208 were below the curve and 128,820 were above. We thus estimate the value of $I$ as $\frac{{\text 185,208}}{{\text 185,208}+{\text 128,820}} \cdot \pi \approx 1.852854$. This is the percentage of sample points in the rectangle which fell below the curve, multiplied by the total area of the sample rectangle ($1 \cdot \pi$).

To approximate the value of $\pi$, we can use the same technique: we draw a circle of radius 1 centered at $(0,0)$ and choose $N$ random sample points $(x,y)$ from the uniform distribution on the square $\left[ -1, 1 \right] \times \left[ -1,1 \right]$. The total area of the square is 4, and the area of the circle is $\pi \cdot 1^{2} = \pi$. Thus, we would expect the percentage of sample points falling within the circle (the red points in the below animation) to be approximately $\frac{\pi}{4}$.

After running our program, we can count the number $N_1$ of sample points which fell within the circle (a point falls within the circle if its radius is less than 1, i.e. if $\sqrt{x^2 + y^2}<1$). Our $\pi$ approximation is then $\pi \approx 4 \tfrac{N_1}{N}$.

Animation of the Monte Carlo method to estimate $\pi$
(image courtesy of Wikipedia)

The below chart shows three of my simulation runs using the above method, with 100,000 sample points per run ($N= \text{100,000}$). Based on the chart, the approximation seems to generally improve as $N$ increases; while this is not exactly surprising, the Law of Large Numbers, which I will explain in the next section, formalizes this notion. The Law of Large Numbers states that approximations like this one will always yield sample averages close to the true value as $N \rightarrow \infty$. In other words, I didn't just run my simulation 100 times and choose three "lucky" results to show you.

Digging a bit deeper, it appears that $N=\text{10,000}$ would suffice for an approximation within about 0.02-0.03 of the actual value, but if we want precision within 0.01, we need an $N$ somewhere in the neighborhood of 50-100,000 samples. The Central Limit Theorem, which I will also address later in this post, tells us exactly how large we need to make $N$ in order to obtain, with a given level of confidence (e.g. 95%), an approximation with a given level of precision (e.g. 0.01).


I have included the C++ code for my simulation at the end of the post so that interested readers can see the details.


The Law of Large Numbers


The Law of Large Numbers formalizes the notion that sample averages converge to the long-run average. There are actually two versions, known as the Weak Law and Strong Law of Large Numbers, which correspond to different types of random variable convergence. Let's take a look at each and interpret them in the context of the $\pi$ simulation.

For the below definitions, let $X_1, X_2, X_3, \dotsc$ be a sequence of random variables, independent from each other, which all have the same probability distribution. These are called independent and identically distributed (iid for short). Since the distributions are identical, the mean of each $X_i$ is the same value, $\mu$, which we will assume is finite. The symbol $\overline{X}_n$ will denote the sample average, defined as $\frac{\sum_{i=1}^{n}{X_i}}{n}
$.

The Weak Law of Large Numbers states that, as $n \rightarrow \infty$, the sample average converges in probability to $\mu$, which means that, for any threshold $\epsilon > 0$ (no matter how small), we have $$
\lim_{n \rightarrow \infty} {\Bbb P} \left( \left| \overline{X}_n - \mu \right| > \epsilon \right) = 0
$$ In other words, for any $\epsilon$ specified (e.g. one thousandth, one millionth, etc.), the probability that the sample average will be further than $\epsilon$ from $\mu$ can be made arbitrarily small (but not necessarily zero) by choosing a large enough sample size $n$.

The Strong Law of Large Numbers states that the sample means converge almost surely to $\mu$ (almost surely means "with probability 1"), or in symbols, $$
{\Bbb P} \left( \lim_{n \rightarrow \infty}{\overline{X}_n} = \mu \right) = 1
$$ This is equivalent to the statement $$
{\Bbb P} \left( \lim_{n \rightarrow \infty}{\overline{X}_n} \neq \mu \right) = 0
$$ In other words, the probability of a sample outcome in which there is not an $m$ large enough that $\overline{X}_n$ is within $\epsilon$ of $\mu$ whenever $n>m$, is equal to zero. This is a stronger statement than the Weak Law in that it implies the Weak Law.

For the $\pi$ simulation, a single $X_i$ is the outcome of the experiment of choosing a random sample point within the square $\left[ -1, 1 \right] \times \left[ -1, 1 \right]$ and assigning $X_i = 1$ if the point is within the circle of radius 1 or $X_i = 0$ if not. Thus, the expected value of each $X_i$ is $\mu = \pi / 4$, since $$
\begin{align}
{\Bbb E}(X_i) &= 0 \cdot {\Bbb P}(X_i = 0) + 1 \cdot {\Bbb P}(X_i = 1) \\
&= 0 \cdot (1 - \pi / 4) + 1 \cdot (\pi / 4) \\
&= \pi / 4
\end{align}
$$ The Run 1, 2, and 3 lines in the chart above represent the random variable $4 \overline{X}_n$ (with $n$ on the horizontal axis), and $4X_i$ has expected value $\pi$, the dotted red line. Both Laws of Large Numbers essentially tell us that it is unlikely for the simulated values to be too far from $\pi$ when $n$ grows large enough.

More precisely, take $\epsilon = 0.0001$ as an example error threshold and 99.9% as an example probability threshold. The Weak Law tells us that there exists some $m$ large enough that, if we ran a new Run 4 simulation, there would be a probability of at least 99.9% that its line would remain within a band 0.0001 above or below the dotted red line everywhere to the right of $n=m$.

The Strong Law tells us that there is a probability of exactly 100% that, for some large enough $m$, the Run 4 line would remain within the above-mentioned error band to the right of $n=m$. Thus, the Strong Law clearly implies the Weak Law, but while there is a subtle difference in interpretation, they both essentially say that, in some sense, $\overline{X}_n \rightarrow \mu$ as $n \rightarrow \infty$.

Note: for a proof of the Law of Large Numbers using inequalities about the likely "spread" of a random variable around its mean, see this post.

What the Law of Large Numbers does not tell us is how large this "$m$" value needs to be. This is where the Central Limit Theorem comes in.


The Central Limit Theorem


We know that the expected value of a sample mean $\overline{X}_n$ is the population mean $\mu$, since $$
{\Bbb E}(\overline{X}_n)
= {\Bbb E}\left( \frac{\sum_{i=1}^{n}{X_i}}{n} \right)
= \frac{\sum_{i=1}^{n}{{\Bbb E}(X_i)}}{n}
= \frac{n \mu}{n} = \mu
$$ The Law of Large numbers tells us that, with a very high probability, $\overline{X}_n$ will indeed become close to its expected value for large enough values of $n$. The Central Limit Theorem goes a step further and actually gives us the full probability distribution of $\overline{X}_n$, known as the sampling distribution of the mean.

Central Limit Theorem: If $X_1, X_2, X_3, \dotsc$ are iid with mean $\mu$ and finite variance $\sigma^2$, then, as $n \rightarrow \infty$, the random variable $\overline{X}_n$ converges in distribution to a Normal random variable with mean $\mu$ and variance $\frac{\sigma^2}{n}$.

In this context, convergence in distribution means that the sampling distribution of $\overline{X}_n$ converges pointwise to the applicable Normal distribution.

What's amazing about the Central Limit Theorem is that the limiting distribution of $\overline{X}_n$ is Normal regardless of the distribution of the $X_i$'s themselves. In the $\pi$ example, the distribution of $X_i$ is obviously not Normal: it's the discrete distribution $$
p(x) = \begin{cases} 1 - \pi / 4 & \text{if $x=0$} \\ \pi / 4 & \text{if $x=1$}
\end{cases}
$$ Nonetheless, the CLT tells us that the sampling distribution of $\overline{X}_n$ will still be approximately Normal for large enough values of $n$. In practice, the rule of thumb is that the Normal approximation is valid when $n$ is at least 30; this "For Dummies" article provides a bit more color on this topic. The below image from Wikipedia shows the sampling distribution of the mean for differing sizes of $n$, using the example of coin flips (the $X_i$'s here are the same as the $\pi$ approximation except that they have "success" probability $1/2$ instead of $\pi/4$):

Sampling distribution of the mean $\frac{1}{n}\sum_{i=1}^{n}{X_i}$ for various values of $n$, where each $X_i$ is 1 for heads and 0 for tails. Notice that the CLT's Normal approximation seems to be most accurate starting around $n=30$.
(image courtesy of Wikipedia)

We can use the CLT-provided Normal approximation to determine how many simulations we need in order to have at least, say, 95% probability of obtaining a $\pi$ estimate within, say, 0.01 of the true value of 3.1415[...]. In order to do this, we need to find the variance of the random variable $4 X_i$. This is given by the computation $$
\begin{align}
\sigma^2_{4 X_i} &= 4^2 \cdot \sigma^2_{X_i} = 16 \cdot {\Bbb E}\left[ (X_i - \mu_{X_i})^2 \right] \\[2mm]

&= 16 \bigg[
\underbrace{\left( 1 - \frac{\pi}{4} \right)}_{p(0)} \left(0 - \frac{\pi}{4} \right)^2
+ \underbrace{\frac{\pi}{4}}_{p(1)} \left( 1 - \frac{\pi}{4} \right)^2
\bigg] \\[2mm]

&= 16 \left[
\left(\frac{\pi}{4} \right)^2 - \left( \frac{\pi}{4} \right)^3 + \frac{\pi}{4} - 2 \left( \frac{\pi}{4} \right)^2 + \left( \frac{\pi}{4} \right)^3
\right] \\[2mm]

&= 16 \left[
\frac{\pi}{4} - \left( \frac{\pi}{4} \right)^2
\right] \\[2mm]

&= 4 \pi \left[ 1 - \frac{\pi}{4} \right] \\[2mm]

&= 4 \pi - \pi^2
\end{align}
$$ The below Excel screenshot shows the math needed to obtain a suitable value of $N$, which turns out to be just over the value of 100,000 which I used for my simulations.




$\pi$ approximation source code (C++)


Below is the full source code for my program, as well as the result and the first few lines of the output file "Program Output.txt". Underneath, I have added some commentary about what the different sections do.


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
#include<iostream>
#include<fstream>
#include<random>
#include<vector>
#include<cmath>
#include<chrono>
using namespace std;

class point //represents a point (x,y) in the plane
{
private:
    double x;
    double y;
public:
    point(double _x, double _y): x(_x), y(_y) {} //constructor
    void assign_vals(double _x, double _y) {x = _x; y = _y;}
    double len(void){ //distance from the point to the origin
        return pow(pow(x,2)+pow(y,2),0.5);
        }
};

// obtain a seed from the system clock:
unsigned seed = std::chrono::system_clock::now().time_since_epoch().count();
static std::mt19937 gen(seed);
double rand_num(){
    static std::uniform_real_distribution<double> urd(-1.0,1.0); //uniform random real number between -1 and 1
    return urd(gen); //use gen as a generator
}

int main()
{
    //drop N random points into the square(-1,1)x(-1,1)
    //check if they are within circle of radius 1
    //circle area = pi*(1^2) = pi
    //square area = 2*2 = 4
    //pi estimate = 4 * (# pts in circle / # pts total)
    const int N = 100000; //number of sims
    vector<bool> results(N,0); //vector of N 0's to store intermediate results
    vector<double> pi_approx_vect(N,0); //vector of N 0's to store approximation results
    double x, y, r;
    int sum=0; //sum counter dummy variable
    double pi_approx;
    point * p = new point(0,0); //pointer to point

    ofstream myfile ;
    myfile.open("Program Output.txt", ios::trunc);

    for(int j=1; j<=results.size(); j++){
        x = rand_num(); //random number btw -1.0 and 1.0
        y = rand_num(); //'''
        p->assign_vals(x,y);
        r = p->len();

        if (r < 1){results[j] = 1;}
        else {results[j] =0;}

        sum = sum + results[j];

        pi_approx_vect[j] = 4 * sum / static_cast<double>(j);

        if (myfile.is_open())
        {
        myfile << results[j] << endl;
        }

    }

    delete p;
    pi_approx = 4 * sum / static_cast<double>(N);

    cout << "N: " << N << endl;
    cout << "pi approximation: " << pi_approx;

    myfile.close();

    return 0;
}

C++ program output

"Program Output.txt" output

My notes on the program:

  • Lines 1-7: C++ standard libraries included for various purposes
  • Lines 9-20: definition of the "point" class: in the "main" program, the class is instantiated to create a new "point" object $p$, which has members called $x$ and $y$. Since these members are declared using the "private" keyword in the class definition, the "main" program cannot access them directly. Instead, it will populate their values using the constructor (e.g. when they are initiated with 0's in the variable declaration in line 43) or the "assign_vals" function (in line 51). The "len" function returns a number representing the "radius", $\sqrt{x^2+y^2}$, of the point.
  • Lines 22-28: "seed" the random number generator using the current clock time, which ensures we will get different random numbers each time we run the program. I also define the "rand_num()" function to generate random numbers from the uniform distribution on $[-1,1] \times [-1,1]$.
  • "main" program, lines 37-43: variable declarations
  • "main" program, lines 48-66: looping from $j=1$ to $N$, populate the point $p$ with random values of $x$ and $y$, and if the radius $r<1$, populate the $j$-th entry of the "results" vector with a 1. If $r \geq 1$, then "results[j]" is populated with a 0 instead.
  • "main" program, line 69: the $\pi$ approximation is calculated as 4 times the sum of the "results" vector (i.e. the number of "successes") divided by $N$. Since both $N$ and "sum" are integer variables, "static_cast<double>" is included to ensure that the quotient is calculated as a decimal number rather than being rounded off to the nearest integer.
Any questions or constructive criticism on the code is more than welcome in the Comments section- I am fairly new to C++, so some of the above may be unconventional. Note that it is certainly possible to write the same simulation without using classes, but I did it this way to experiment a bit as a learning exercise for myself.

That will do it for this post. Thanks for reading, and thanks to "Anonymous" for the great Reader Request.

Monte Carlo Simulation - Part 1

Preliminary: How do mathematicians model randomness?

In this two-part post, I will introduce Monte Carlo methods, an umbrella term for techniques involving many random simulations of a phenomenon of interest to obtain a numerical result (or range of results). This post is a response (albeit a late one- my apologies for that) to a reader request from Anonymous dated August 2016.

In Part 1, I will explain at a high level what a Monte Carlo simulation is, what kinds of typical inputs and outputs we may expect, and the benefits and limitations of Monte Carlo methods. I use the example of retirement/investment planning software, which Anonymous also mentioned in his request.

In Part 2, I will walk through a considerably simpler example in detail: a Monte Carlo method to approximate the value of $\pi$. I will also explain two well known statistical results, the Law of Large Numbers and the Central Limit Theorem, which justify the use of Monte Carlo methods and allow us to quantify their accuracy. The $\pi$ simulation will directly illustrate the LLN and CLT.


Monte Carlo methods - high-level overview


In the most general terms, a Monte Carlo simulation is a method whereby we simulate one or many random inputs many times (usually thousands or millions), making assumptions as to their probability distributions, and combine the results in a deterministic (i.e. non-random) way to obtain a numerical output. In this section, I will explain in layman's terms why and how we may do this as well as some limitations of these methods.

When to employ a Monte Carlo method

In the reader request, Anonymous mentioned a typical situation in which an analyst may employ a Monte Carlo method: a wealth manager wishes to build a portfolio of various assets for a client with the goal of providing sufficient income for the client after his retirement. More precisely, for a prospective portfolio to be considered acceptable, it must have a sufficiently high probability of providing enough cash to cover the client's estimated living costs at all times after his retirement date.

Suppose we would like to assess the viability of a proposed portfolio consisting of a number of equities (stocks) and fixed-income assets (bonds). We will need to make static assumptions about the client's age, retirement age, initial portfolio size, and post-retirement income and living costs. We will also need to make certain assumptions regarding the probability distributions of interest rates, equity returns, debt yields, dividend rates, inflation, mortality, etc. Note that we may re-categorize some of the "static" assumptions as variable and vice-versa, depending on the goals of the analysis.

In any case, with so many variable inputs, it is no simple task to determine the probability that the proposed portfolio will be acceptable at any given time, let alone at all times. The number of random inputs could be enormous, and we do not necessarily have a tractable way of combining them all to arrive at a probability distribution for the portfolio value at some time $t$, so we can't solve this out analytically. However, we do have clear assumptions for all the inputs, and we have computers.

Instead of an analytical solution, we can use a Monte Carlo simulation to arrive at a numerical solution.

Typical inputs, outputs, and interpretation

I mentioned before that a Monte Carlo simulation consists of many "runs" of a deterministic procedure based on randomized inputs.

To perform a single run of a Monte Carlo simulation, we program a computer to simulate each input by drawing random numbers from an assumed probability distribution. In the retirement example, we would randomly generate a sample path over time for each relevant interest rate, stock price, bond price, etc. Given these paths, we can (deterministically!) compute the value of the portfolio at each time, and then it is a matter of arithmetic to determine whether the portfolio covered the living costs at each time or failed. This run represents one of the many possible scenarios which could occur.

We repeat this process for, say, 100,000 runs (in Part 2, I will elaborate further on how many runs we need to use), calculating the simulated probability of the portfolio's success as the number of successes divided by 100,000. If that probability exceeds a predetermined threshold (e.g. 90%), then we deem the portfolio "acceptable".

A typical software would also provide more sophisticated chart outputs such as the below (which I conveniently found on Google). The first image shows median portfolio values at each time based on certain fixed input parameters, with the median taken over all the simulation runs.

The second image shows a "heat map" of success probabilities based on different values of the input parameters. Each pixel on the heat map summarizes a full probability distribution (presumably of portfolio shortfall) for a fixed set of values of the input parameters: a green pixel indicates that the input parameter values (the $x$- and $y$-axis values) lead to a portfolio which is likely to cover costs in each year after retirement, while a red pixel indicates that the parameter values lead to a portfolio which is more likely to fail, i.e. to not cover costs (even with previous years' excesses) in at least one year after retirement.



The software must make assumptions about the statistical properties of the portfolio in order to generate chart outputs like the above. With respect to the probabilistic inputs (these do not include the deterministic inputs such as retirement age, tax rates, initial portfolio size, etc.), the images show only a mean and standard deviation of the investment returns. Therefore, this software likely assumes Normally distributed returns.

It's also worth noting that, for example, Life Expectancy seems to be an input here with a value of 95. This suggests that the simulations may not be randomizing that input. In interpreting this output, it would be important to read the software documentation and understand the major assumptions employed.


Benefits and limitations of Monte Carlo methods

The most obvious benefit of a Monte Carlo simulation is that it allows us to run millions of complex scenarios in just seconds or minutes, a task which would be impossible without a computer. Another key benefit is a bit more subtle.

If we didn't have computers available, we may attempt to answer our retirement question using "what-if scenarios": we would choose a few sets of very simple assumptions about equity and debt returns which allow us to use "back-of-the-envelope" arithmetic to calculate the portfolio value. For example, we may assume fixed, non-random annual returns with three cases as follows:


We may even go a step further and assume probability distributions for these returns (e.g. Normal distributions) which allow us to arrive at an analytical solution for the portfolio value's probability distribution over time. This is certainly a reasonable approach; however, these what-if scenarios give us no indication of how likely the three different cases are. A Monte Carlo simulation, on the other hand, gives us a distribution of possible outcomes and the probabilities of those outcomes. Unlikely scenarios will not be over-represented among the simulation runs.

Of course, no method is without limitations. On the practical side, without readymade software, we typically need to write our own code to perform a Monte Carlo simulation; this takes time and effort which may not be necessary. For very complex simulations, we may need to worry about the integrity of the random number generator(s) and, in some cases, the speed of the code, though these are beyond the scope of this post.

From a more qualitative standpoint, we have the "garbage in, garbage out" principle: a Monte Carlo simulation, like any analysis, is based on certain input assumptions. We must always keep in mind which assumptions went into a simulation when interpreting its results. These assumptions include both input probability distributions and the input parameter values used to calibrate those distributions.

For example, the chart outputs above probably assumed "normal" market conditions and Normally distributed returns. The above simulation may therefore underestimate the probability of portfolio failure due to a financial crisis (a non-normal and certainly non-Normal event). Similarly, if life expectancy was indeed a static input, then the analysis may underestimate the probability of failure for people who are very healthy and thus likely to live longer than average.

Finally, we must determine "reasonable" values of the input parameters. More often than not, we base these on historical data as our best guess. This implies that the simulation will weight possible scenarios based on their historical relative frequencies, while future frequencies may differ. On the other hand, assuming different parameter values based on our own informed estimates clearly introduces a different bias into the model.

In conclusion, Monte Carlo simulation is a powerful tool, but it is still just that: a tool. We must remain critical of our analysis and the assumptions that underlie it.

Stay tuned for Part 2, in which we will walk through a simple Monte Carlo method to estimate $\pi$ and introduce the Law of Large Numbers and Central Limit Theorem.