## Parameter Estimation - Part 1 (Reader Request)

Preliminaries: How do mathematicians model randomness?, Monte Carlo Simulation - Part 2, Proof of the Law of Large Numbers

The preliminary post How do mathematicians model randomness? introduced random variables, their probability distributions, and parameters thereof (namely, mean, variance, and standard deviation). This post, the response to a reader request from Anonymous, will cover estimation of parameters based on random sampling. I will explain the difference between

In Part 2 of this post, I will present a well known historical application of parameter estimation, the

Recall from the first preliminary post that we model random phenomena with

Parameters are defined based on all possible values of a random variable (the

\mu = \sum_{i}{x_i {\Bbb P}(x_i)}

$$ where each $x_i$ is one of the possible values of $X$ and the sum runs over all possible values (in the continuous case, the sum would be replaced by an integral).

In practice, we do not have access to all possible values of $X$ and their probabilities. Instead, we typically have a

\bar{x} = \frac{1}{n}\sum_{i=1}^{n}{x_i}

$$ The sample mean is a

As random variables, statistics have their own probability distributions, known as

Since ${\Bbb E}(\bar{x}) = \mu$, we say that the sample mean is an

s^2_1= \frac{1}{n} \sum_{i=1}^{n}{(x_i-\bar{x})^2}

$$ However, this statistic is

\begin{align}

{\Bbb E} \left[ \sigma^2 - s_1^2 \right]

&= {\Bbb E} \left[ \frac{1}{n}\sum_{i=1}^{n}{(x_i - \mu)^2} - \frac{1}{n}\sum_{i=1}^{n}{(x_i-\bar{x})^2}\right] \\[2mm]

&= \frac{1}{n}{\Bbb E}\left[

\sum_{i=1}^{n}{\left(

\left( x_i^2 - 2 x_i \mu + \mu^2) - (x_i^2 - 2 x_i \bar{x} + \bar{x}^2 \right)

\right)} \right] \\[2mm]

&= {\Bbb E}\left[

\mu^2 - \bar{x}^2 + \frac{1}{n}\sum_{i=1}^{n}{(2x_i (\bar{x}-\mu))}

\right] \\[2mm]

&= {\Bbb E}\left[ \mu^2 - \bar{x}^2 + 2\bar{x}(\bar{x}-\mu) \right] \\[2mm]

&= {\Bbb E}\left[ \mu^2 - 2\bar{x}\mu + \bar{x}^2 \right] \\[2mm]

&= {\Bbb E} \left[ (\bar{x} - \mu)^2 \right] \\[2mm]

&= \rm{Var}(\bar{x}) \\[2mm]

&= \frac{\sigma^2}{n}

\end{align}

$$ Since ${\Bbb E}[\sigma^2 - s_1^2] = {\Bbb E}[\sigma^2] - {\Bbb E}[s_1^2] = \sigma^2 - {\Bbb E}[s_1^2]$, the above implies that $$

{\Bbb E}[s_1^2] = \sigma^2 - \frac{\sigma^2}{n} = \frac{n-1}{n}\sigma^2

$$ Therefore, the statistic $s^2 = \frac{n}{n-1}s_1^2 = \frac{1}{n-1}\sum_{i=1}^{n}{(x_i-\bar{x})^2}$, known as the

The replacement of $\frac{1}{n}$ with $\frac{1}{n-1}$ in the sample variance formula is known as

Given $s^2$ is an unbiased estimator of $\sigma^2$, we may expect that the

\begin{align}

{\Bbb E}\left[ s^2 \right] &= \sigma^2 \\[2mm]

\Rightarrow {\Bbb E}\left[ \sqrt{s^2} \right] &< \sqrt{{\Bbb E}\left[ s^2 \right]} = \sqrt{\sigma^2} = \sigma

\end{align}

$$ where the inequality follows from Jensen's inequality and the fact that the square root is a concave function (since the area above it is concave, not convex). In other words, $s$ underestimates $\sigma$ on average.

Unfortunately, for estimating the population standard deviation, there is no easy correction as there is for the variance. The size of the necessary correction depends on the distribution of the underlying random variable. For the Normal distribution, there is a complicated exact formula, but simply replacing the $n-1$ in the denominator with $n-1.5$ eliminates most of the bias (with the remaining bias decreasing with increasing sample size). A further adjustment is possible for other distributions and depends on the

While the specific corrections are beyond the scope of this post, for the brave, there is an entire Wikipedia article dedicated to exactly this topic.

Zero bias is certainly a desirable quality for a statistic to have, but an estimator's quality depends on more than just its expected value. A statistic's variance tells us how large of a spread (from its expected value) we may expect when calculating the statistic based on various samples. Just as a statistic with large bias is not particularly helpful, neither is one with no bias but a large variance.

The notion of

For example, the (weak) law of large numbers implies that $\bar{x}$ is a consistent estimator of $\mu$, as $\lim_{n \rightarrow \infty}{{\Bbb P} \left[ \left| \bar{x} - \mu \right| \geq \epsilon \right]} = 0$ for any $\epsilon > 0$. Furthermore, $s_1^2$ and $s^2$ are both consistent estimators of $\sigma^2$, while $s$ is a consistent estimator of $\sigma$. These examples show that both biased and unbiased estimators can be consistent.

That will do it for Part 1 of this post. Thanks for reading, and look out for Parts 2 and 3, coming up soon. Thanks to Anonymous for the great reader request.

Wikipedia- unbiased estimation of standard deviation

Wikipedia - Bessel's correction

Quora post- estimator bias vs. variance

The preliminary post How do mathematicians model randomness? introduced random variables, their probability distributions, and parameters thereof (namely, mean, variance, and standard deviation). This post, the response to a reader request from Anonymous, will cover estimation of parameters based on random sampling. I will explain the difference between

*parameters*and*statistics*, introduce the concept of estimator*bias*, and address the reader request's specific question about unbiased estimation of standard deviation.In Part 2 of this post, I will present a well known historical application of parameter estimation, the

*German Tank Problem*, and compare methods of estimating an unknown population size. Finally, in Part 3, I will introduce*complete*and*sufficient*statistics, which allow us to prove that the best estimator among the candidates in Part 2 is the unique*minimum variance unbiased estimator*.### Parameters and statistics

Recall from the first preliminary post that we model random phenomena with

*random variables*and their*probability distributions*. Key characteristics of these distributions (mean, variance, standard deviation, etc.) are called**parameters**.Parameters are defined based on all possible values of a random variable (the

**population**) weighted by their relative frequencies of occurrence (i.e. their probabilities). For example, the mean (denoted $\mu_{X}$, ${\Bbb E}(X)$, or simply $\mu$ when there is no ambiguity as to the underlying random variable) of an outcome of interest (i.e. random variable) $X$ is defined as $$\mu = \sum_{i}{x_i {\Bbb P}(x_i)}

$$ where each $x_i$ is one of the possible values of $X$ and the sum runs over all possible values (in the continuous case, the sum would be replaced by an integral).

In practice, we do not have access to all possible values of $X$ and their probabilities. Instead, we typically have a

**sample**of observed values, $x_1, x_2, \dotsc, x_n$. Given such a sample, we could estimate $\mu$ using the**sample mean**$$\bar{x} = \frac{1}{n}\sum_{i=1}^{n}{x_i}

$$ The sample mean is a

**statistic**, a value calculated based on sample data, which can be used to estimate the (unknown) parameter value. Notice that $\bar{x}$ can take on different values depending on which random sample we used to calculate it. In other words, $\bar{x}$ is itself a random variable.### Estimator bias, Bessel's correction

As random variables, statistics have their own probability distributions, known as

**sampling distributions**, and thus their own means, variances, standard deviations, etc. We actually already touched upon this fact in the earlier posts Monte Carlo Simulation - Part 2 and Proof of the Law of Large Numbers, in which we proved that the sample mean $\bar{x}$ has expected value $\mu$ and variance $\frac{\sigma^2}{n}$ (a fact that we will use below).Since ${\Bbb E}(\bar{x}) = \mu$, we say that the sample mean is an

**unbiased estimator**of the population mean $\mu$. By the same logic, we may estimate the population variance $\sigma^2 = \sum_{i}{(x_i-\mu)^{2}{\Bbb P}(x_i)}$ using the statistic $$s^2_1= \frac{1}{n} \sum_{i=1}^{n}{(x_i-\bar{x})^2}

$$ However, this statistic is

*not*an unbiased estimator of $\sigma^2$. In order to see why this is the case, we can compute the expected value ${\Bbb E}(\sigma^2 - s_1^2)$, which would be zero if $s_1^2$ were unbiased: $$\begin{align}

{\Bbb E} \left[ \sigma^2 - s_1^2 \right]

&= {\Bbb E} \left[ \frac{1}{n}\sum_{i=1}^{n}{(x_i - \mu)^2} - \frac{1}{n}\sum_{i=1}^{n}{(x_i-\bar{x})^2}\right] \\[2mm]

&= \frac{1}{n}{\Bbb E}\left[

\sum_{i=1}^{n}{\left(

\left( x_i^2 - 2 x_i \mu + \mu^2) - (x_i^2 - 2 x_i \bar{x} + \bar{x}^2 \right)

\right)} \right] \\[2mm]

&= {\Bbb E}\left[

\mu^2 - \bar{x}^2 + \frac{1}{n}\sum_{i=1}^{n}{(2x_i (\bar{x}-\mu))}

\right] \\[2mm]

&= {\Bbb E}\left[ \mu^2 - \bar{x}^2 + 2\bar{x}(\bar{x}-\mu) \right] \\[2mm]

&= {\Bbb E}\left[ \mu^2 - 2\bar{x}\mu + \bar{x}^2 \right] \\[2mm]

&= {\Bbb E} \left[ (\bar{x} - \mu)^2 \right] \\[2mm]

&= \rm{Var}(\bar{x}) \\[2mm]

&= \frac{\sigma^2}{n}

\end{align}

$$ Since ${\Bbb E}[\sigma^2 - s_1^2] = {\Bbb E}[\sigma^2] - {\Bbb E}[s_1^2] = \sigma^2 - {\Bbb E}[s_1^2]$, the above implies that $$

{\Bbb E}[s_1^2] = \sigma^2 - \frac{\sigma^2}{n} = \frac{n-1}{n}\sigma^2

$$ Therefore, the statistic $s^2 = \frac{n}{n-1}s_1^2 = \frac{1}{n-1}\sum_{i=1}^{n}{(x_i-\bar{x})^2}$, known as the

**sample variance**, has expected value $\sigma^2$ and is thus an unbiased estimator of the population variance.The replacement of $\frac{1}{n}$ with $\frac{1}{n-1}$ in the sample variance formula is known as

**Bessel's correction**. The derivation above shows that the bias in $s_1^2$ arises due to the fact that $(x_i-\bar{x})$ underestimates the actual quantity of interest, $(x_i-\mu)$, by $(\bar{x}-\mu)$ for each $x_i$. Therefore, the bias is the variance of $\bar{x}$, which we proved to be $\frac{\sigma^2}{n}$ in Proof of the Law of Large Numbers. Using $s^2$ instead of $s_1^2$ corrects for this bias.### Estimation of the standard deviation

Given $s^2$ is an unbiased estimator of $\sigma^2$, we may expect that the

**sample standard deviation**$s=\sqrt{s^2}$ would also be an unbiased estimator of the population standard deviation $\sigma$. However, $$\begin{align}

{\Bbb E}\left[ s^2 \right] &= \sigma^2 \\[2mm]

\Rightarrow {\Bbb E}\left[ \sqrt{s^2} \right] &< \sqrt{{\Bbb E}\left[ s^2 \right]} = \sqrt{\sigma^2} = \sigma

\end{align}

$$ where the inequality follows from Jensen's inequality and the fact that the square root is a concave function (since the area above it is concave, not convex). In other words, $s$ underestimates $\sigma$ on average.

Unfortunately, for estimating the population standard deviation, there is no easy correction as there is for the variance. The size of the necessary correction depends on the distribution of the underlying random variable. For the Normal distribution, there is a complicated exact formula, but simply replacing the $n-1$ in the denominator with $n-1.5$ eliminates most of the bias (with the remaining bias decreasing with increasing sample size). A further adjustment is possible for other distributions and depends on the

*excess kurtosis*, a measure of the "heavy-tailedness" of the distribution in excess of that of the Normal distribution.While the specific corrections are beyond the scope of this post, for the brave, there is an entire Wikipedia article dedicated to exactly this topic.

### Other measures of estimator quality

Zero bias is certainly a desirable quality for a statistic to have, but an estimator's quality depends on more than just its expected value. A statistic's variance tells us how large of a spread (from its expected value) we may expect when calculating the statistic based on various samples. Just as a statistic with large bias is not particularly helpful, neither is one with no bias but a large variance.

The notion of

*consistency*ties bias and variance together nicely: a**consistent estimator**is one which*converges in probability*to the population parameter. This means that, as $n \rightarrow \infty$, the probability of an error greater than some specified amount $\epsilon$ approaches zero. This further implies that both the bias and the variance tend to zero as the sample size grows.For example, the (weak) law of large numbers implies that $\bar{x}$ is a consistent estimator of $\mu$, as $\lim_{n \rightarrow \infty}{{\Bbb P} \left[ \left| \bar{x} - \mu \right| \geq \epsilon \right]} = 0$ for any $\epsilon > 0$. Furthermore, $s_1^2$ and $s^2$ are both consistent estimators of $\sigma^2$, while $s$ is a consistent estimator of $\sigma$. These examples show that both biased and unbiased estimators can be consistent.

That will do it for Part 1 of this post. Thanks for reading, and look out for Parts 2 and 3, coming up soon. Thanks to Anonymous for the great reader request.

### Sources:

Wikipedia- unbiased estimation of standard deviation

Wikipedia - Bessel's correction

Quora post- estimator bias vs. variance