Powered by Blogger.

Proof of the Law of Large Numbers

Preliminaries: Monte Carlo Simulation - Part 2, How do mathematicians model randomness?

In Monte Carlo Simulation - Part 2, I introduced the Weak and Strong Laws of Large Numbers as well as the Central Limit Theorem. While the proof of the latter is a bit more technical, I wanted to include a rather intuitive proof of the (Weak) Law of Large Numbers.

The LLN essentially states that, given a sufficiently large sample size, a random sample mean is unlikely to fall far away from the long-run mean. Thus, it should come as no surprise that the standard proof consists of taking the limit as $n \rightarrow \infty$ of an inequality limiting the likely "spread" of a random variable from its mean.

The Law of Large Numbers states that, for large sample sizes,
sample means will likely converge to the theoretical mean.
(image courtesy of Wikipedia)


Below, I'll prove two random variable inequalities called Markov's Inequality and Chebyshev's Inequality and then show you how the LLN falls right out of the latter when we take a limit.

Markov's Inequality: If $X$ is a non-negative random variable, then the probability that $X$ exceeds a positive value $a$ is no greater than the mean of $X$ divided by $a$. In symbols: $$
{\Bbb P}(X \geq a) \leq \frac{{\Bbb E}(X)}{a}
$$ Proof: Let $I$ be the indicator random variable for the event $X \geq a$. In other words, $I$ is a random variable which equals 1 if $X \geq a$ and 0 otherwise. Then it is always the case that $aI \leq X$:

  • If $X<a$, then $I=0$, so $aI=0 \leq X$, since $X$ is non-negative.
  • If $X \geq a$, then $I=1$, so $aI = a \cdot 1 = a \leq X$ since $X \geq a$.
Since $aI \leq X$ is always true, the means must satisfy the inequality ${\Bbb E}(aI) \leq {\Bbb E}(X) \ ( \spadesuit )$. But the left side of this inequality is equal to: 
$$
\begin{align}
{\Bbb E}(aI) &= a {\Bbb E}(I) \\
&= a(1\cdot {\Bbb P}(I=1) + 0 \cdot {\Bbb P}(I=0)) \\
&= a {\Bbb P}(I=1) \\
&= a {\Bbb P}(X \geq a) \tag{$\dagger$}
\end{align}
$$ Substituting $( \dagger )$ into $( \spadesuit )$ gives
$$ a {\Bbb P}(X \geq a) \leq {\Bbb E}(X) $$ Dividing both sides by $a$ yields the desired inequality.
$\square$

Chebyshev's Inequality: Let $X$ be a random variable with mean $\mu$ and non-zero, finite variance $\sigma^2$, and let $k \geq 1$. Then the probability is at most $\frac{1}{k^2}$ that $X$ falls at least $k$ standard deviations from its mean. In symbols: $$
{\Bbb P}( | X - \mu | \geq k \sigma ) \leq \frac{1}{k^2}
$$ Chebyshev's inequality puts a maximum on how frequently a random variable will exhibit values far away from its mean.

Proof: $|X - \mu| \geq k \sigma$ if and only if $(X-\mu)^2 \geq (k \sigma)^2$. Since $(X-\mu)^2$ is non-negative, we can apply Markov's inequality with $a = (k \sigma)^2$: $$
{\Bbb P}\left[ (X- \mu)^2 \geq (k \sigma)^2 \right] \leq \frac{{\Bbb E}\left[ (X-\mu)^2 \right]}{(k \sigma)^2}
$$ But ${\Bbb E}\left[ (X-\mu)^2 \right]$ is the definition of the variance $\sigma^2$, so $$
{\Bbb P} \left[ (X- \mu)^2 \geq (k \sigma)^2 \right] \leq \frac{\sigma^2}{(k \sigma)^2} = \frac{1}{k^2}
$$ as desired.
$\square$

Now, we can apply Chebyshev's inequality to the sample mean $\overline{X}_n$ to obtain the Weak Law of Large Numbers. For the below, let $X_1, X_2, X_3, \dotsc$ be iid random variables with mean $\mu$ and finite variance $\sigma^2$.

Proof of the (Weak) Law of Large Numbers: In Monte Carlo Simulation - Part 2, we saw that the expected value of the sample mean is the long-run mean, i.e. ${\Bbb E}(\overline{X}_n) = \mu$. Furthermore, the variance of $\overline{X}_n$ is $$
\begin{align}
{\rm var} \left( \overline{X}_n \right) &= {\rm var} \left(\frac{1}{n} \sum_{i=1}^{n}{X_i} \right) \\[2mm]

&= \frac{1}{n^2} \sum_{i=1}^{n}{{\rm var}(X_i)} \\[2mm]

&= \frac{n \sigma^2}{n^2} = \frac{\sigma^2}{n}
\end{align}
$$ Now, let $\epsilon > 0$. Then $$
\begin{align}
k \sigma_{\overline{X}_n} = \epsilon
&\iff k^2 {\rm var} \left( \overline{X}_n \right) = \epsilon^2 \\[2mm]

&\iff \frac{1}{k^2} = \frac{{\rm var}\left( \overline{X}_n \right)}{\epsilon^2} = \frac{\sigma^2}{n \epsilon^2}
\end{align}
$$ Thus, by Chebyshev's inequality, $$
{\Bbb P}\left( \left| \overline{X}_n - \mu \right| \geq \epsilon \right)
\leq \frac{1}{k^2}
= \frac{\sigma^2}{n \epsilon^2}
$$ Taking the limit of both sides yields $$
\lim_{n \rightarrow \infty}{\Bbb P}\left( \left| \overline{X}_n - \mu \right| \geq \epsilon \right)
\leq \lim_{n \rightarrow \infty}{\frac{\sigma^2}{n \epsilon^2}} = 0
$$ Finally, since probabilities are non-negative, the limit must be equal to zero.
$\square$

I hope this post helped clarify what the Law of Large Numbers says and why it's true. Thanks for reading, and feel free to ask any questions in the Comments section.

Monte Carlo Simulation - Part 2

Preliminaries: How do mathematicians model randomness?, Monte Carlo Simulation - Part 1

In this post, a continuation of the reader-requested Monte Carlo Simulation - Part 1, I will present another application of Monte Carlo methods and a concrete example (the promised $\pi$ approximation), including the C++ source code so that you can see the "guts" of a Monte Carlo simulation.

I will also explain how the $\pi$ example relates to the Law of Large Numbers and Central Limit Theorem, key probabilistic results which apply beyond just Monte Carlo methods.


Monte Carlo approximation of definite integrals (and $\pi$)


In Part 1, I presented a retirement planning asset allocation problem as an example of when one may employ a Monte Carlo method. Given the enormous number of possible scenarios under the many statistical assumptions, this problem was a prime Monte Carlo candidate.

I will now show you another use of Monte Carlo methods: estimating the value of a definite integral $\int_{a}^{b}{f(x) \, dx}$, i.e. the area under a curve $f(x)$ between $x=a$ and $x=b$, when we cannot calculate it analytically (via the antiderivative). This is especially useful for multidimensional integrals, but I will stick to single-dimensional integrals here just to illustrate the method.

For example, suppose we would like to estimate the value of the definite integral $I=\int_{0}^{\pi}{\frac{\sin(x)}{x} \, dx}$. We have a computer choose many random sample points (from the uniform distribution on the rectangle $\left[ 0, \pi \right] \times \left[ 0,1 \right]$) and count how many fall below the curve (the green points in the below image, courtesy of Google):


In the image above, the computer has chosen 314,028 points so far, of which 185,208 were below the curve and 128,820 were above. We thus estimate the value of $I$ as $\frac{{\text 185,208}}{{\text 185,208}+{\text 128,820}} \cdot \pi \approx 1.852854$. This is the percentage of sample points in the rectangle which fell below the curve, multiplied by the total area of the sample rectangle ($1 \cdot \pi$).

To approximate the value of $\pi$, we can use the same technique: we draw a circle of radius 1 centered at $(0,0)$ and choose $N$ random sample points $(x,y)$ from the uniform distribution on the square $\left[ -1, 1 \right] \times \left[ -1,1 \right]$. The total area of the square is 4, and the area of the circle is $\pi \cdot 1^{2} = \pi$. Thus, we would expect the percentage of sample points falling within the circle (the red points in the below animation) to be approximately $\frac{\pi}{4}$.

After running our program, we can count the number $N_1$ of sample points which fell within the circle (a point falls within the circle if its radius is less than 1, i.e. if $\sqrt{x^2 + y^2}<1$). Our $\pi$ approximation is then $\pi \approx 4 \tfrac{N_1}{N}$.

Animation of the Monte Carlo method to estimate $\pi$
(image courtesy of Wikipedia)

The below chart shows three of my simulation runs using the above method, with 100,000 sample points per run ($N= \text{100,000}$). Based on the chart, the approximation seems to generally improve as $N$ increases; while this is not exactly surprising, the Law of Large Numbers, which I will explain in the next section, formalizes this notion. The Law of Large Numbers states that approximations like this one will always yield sample averages close to the true value as $N \rightarrow \infty$. In other words, I didn't just run my simulation 100 times and choose three "lucky" results to show you.

Digging a bit deeper, it appears that $N=\text{10,000}$ would suffice for an approximation within about 0.02-0.03 of the actual value, but if we want precision within 0.01, we need an $N$ somewhere in the neighborhood of 50-100,000 samples. The Central Limit Theorem, which I will also address later in this post, tells us exactly how large we need to make $N$ in order to obtain, with a given level of confidence (e.g. 95%), an approximation with a given level of precision (e.g. 0.01).


I have included the C++ code for my simulation at the end of the post so that interested readers can see the details.


The Law of Large Numbers


The Law of Large Numbers formalizes the notion that sample averages converge to the long-run average. There are actually two versions, known as the Weak Law and Strong Law of Large Numbers, which correspond to different types of random variable convergence. Let's take a look at each and interpret them in the context of the $\pi$ simulation.

For the below definitions, let $X_1, X_2, X_3, \dotsc$ be a sequence of random variables, independent from each other, which all have the same probability distribution. These are called independent and identically distributed (iid for short). Since the distributions are identical, the mean of each $X_i$ is the same value, $\mu$, which we will assume is finite. The symbol $\overline{X}_n$ will denote the sample average, defined as $\frac{\sum_{i=1}^{n}{X_i}}{n}
$.

The Weak Law of Large Numbers states that, as $n \rightarrow \infty$, the sample average converges in probability to $\mu$, which means that, for any threshold $\epsilon > 0$ (no matter how small), we have $$
\lim_{n \rightarrow \infty} {\Bbb P} \left( \left| \overline{X}_n - \mu \right| > \epsilon \right) = 0
$$ In other words, for any $\epsilon$ specified (e.g. one thousandth, one millionth, etc.), the probability that the sample average will be further than $\epsilon$ from $\mu$ can be made arbitrarily small (but not necessarily zero) by choosing a large enough sample size $n$.

The Strong Law of Large Numbers states that the sample means converge almost surely to $\mu$ (almost surely means "with probability 1"), or in symbols, $$
{\Bbb P} \left( \lim_{n \rightarrow \infty}{\overline{X}_n} = \mu \right) = 1
$$ This is equivalent to the statement $$
{\Bbb P} \left( \lim_{n \rightarrow \infty}{\overline{X}_n} \neq \mu \right) = 0
$$ In other words, the probability of a sample outcome in which there is not an $m$ large enough that $\overline{X}_n$ is within $\epsilon$ of $\mu$ whenever $n>m$, is equal to zero. This is a stronger statement than the Weak Law in that it implies the Weak Law.

For the $\pi$ simulation, a single $X_i$ is the outcome of the experiment of choosing a random sample point within the square $\left[ -1, 1 \right] \times \left[ -1, 1 \right]$ and assigning $X_i = 1$ if the point is within the circle of radius 1 or $X_i = 0$ if not. Thus, the expected value of each $X_i$ is $\mu = \pi / 4$, since $$
\begin{align}
{\Bbb E}(X_i) &= 0 \cdot {\Bbb P}(X_i = 0) + 1 \cdot {\Bbb P}(X_i = 1) \\
&= 0 \cdot (1 - \pi / 4) + 1 \cdot (\pi / 4) \\
&= \pi / 4
\end{align}
$$ The Run 1, 2, and 3 lines in the chart above represent the random variable $4 \overline{X}_n$ (with $n$ on the horizontal axis), and $4X_i$ has expected value $\pi$, the dotted red line. Both Laws of Large Numbers essentially tell us that it is unlikely for the simulated values to be too far from $\pi$ when $n$ grows large enough.

More precisely, take $\epsilon = 0.0001$ as an example error threshold and 99.9% as an example probability threshold. The Weak Law tells us that there exists some $m$ large enough that, if we ran a new Run 4 simulation, there would be a probability of at least 99.9% that its line would remain within a band 0.0001 above or below the dotted red line everywhere to the right of $n=m$.

The Strong Law tells us that there is a probability of exactly 100% that, for some large enough $m$, the Run 4 line would remain within the above-mentioned error band to the right of $n=m$. Thus, the Strong Law clearly implies the Weak Law, but while there is a subtle difference in interpretation, they both essentially say that, in some sense, $\overline{X}_n \rightarrow \mu$ as $n \rightarrow \infty$.

Note: for a proof of the Law of Large Numbers using inequalities about the likely "spread" of a random variable around its mean, see this post.

What the Law of Large Numbers does not tell us is how large this "$m$" value needs to be. This is where the Central Limit Theorem comes in.


The Central Limit Theorem


We know that the expected value of a sample mean $\overline{X}_n$ is the population mean $\mu$, since $$
{\Bbb E}(\overline{X}_n)
= {\Bbb E}\left( \frac{\sum_{i=1}^{n}{X_i}}{n} \right)
= \frac{\sum_{i=1}^{n}{{\Bbb E}(X_i)}}{n}
= \frac{n \mu}{n} = \mu
$$ The Law of Large numbers tells us that, with a very high probability, $\overline{X}_n$ will indeed become close to its expected value for large enough values of $n$. The Central Limit Theorem goes a step further and actually gives us the full probability distribution of $\overline{X}_n$, known as the sampling distribution of the mean.

Central Limit Theorem: If $X_1, X_2, X_3, \dotsc$ are iid with mean $\mu$ and finite variance $\sigma^2$, then, as $n \rightarrow \infty$, the random variable $\overline{X}_n$ converges in distribution to a Normal random variable with mean $\mu$ and variance $\frac{\sigma^2}{n}$.

In this context, convergence in distribution means that the sampling distribution of $\overline{X}_n$ converges pointwise to the applicable Normal distribution.

What's amazing about the Central Limit Theorem is that the limiting distribution of $\overline{X}_n$ is Normal regardless of the distribution of the $X_i$'s themselves. In the $\pi$ example, the distribution of $X_i$ is obviously not Normal: it's the discrete distribution $$
p(x) = \begin{cases} 1 - \pi / 4 & \text{if $x=0$} \\ \pi / 4 & \text{if $x=1$}
\end{cases}
$$ Nonetheless, the CLT tells us that the sampling distribution of $\overline{X}_n$ will still be approximately Normal for large enough values of $n$. In practice, the rule of thumb is that the Normal approximation is valid when $n$ is at least 30; this "For Dummies" article provides a bit more color on this topic. The below image from Wikipedia shows the sampling distribution of the mean for differing sizes of $n$, using the example of coin flips (the $X_i$'s here are the same as the $\pi$ approximation except that they have "success" probability $1/2$ instead of $\pi/4$):

Sampling distribution of the mean $\frac{1}{n}\sum_{i=1}^{n}{X_i}$ for various values of $n$, where each $X_i$ is 1 for heads and 0 for tails. Notice that the CLT's Normal approximation seems to be most accurate starting around $n=30$.
(image courtesy of Wikipedia)

We can use the CLT-provided Normal approximation to determine how many simulations we need in order to have at least, say, 95% probability of obtaining a $\pi$ estimate within, say, 0.01 of the true value of 3.1415[...]. In order to do this, we need to find the variance of the random variable $4 X_i$. This is given by the computation $$
\begin{align}
\sigma^2_{4 X_i} &= 4^2 \cdot \sigma^2_{X_i} = 16 \cdot {\Bbb E}\left[ (X_i - \mu_{X_i})^2 \right] \\[2mm]

&= 16 \bigg[
\underbrace{\left( 1 - \frac{\pi}{4} \right)}_{p(0)} \left(0 - \frac{\pi}{4} \right)^2
+ \underbrace{\frac{\pi}{4}}_{p(1)} \left( 1 - \frac{\pi}{4} \right)^2
\bigg] \\[2mm]

&= 16 \left[
\left(\frac{\pi}{4} \right)^2 - \left( \frac{\pi}{4} \right)^3 + \frac{\pi}{4} - 2 \left( \frac{\pi}{4} \right)^2 + \left( \frac{\pi}{4} \right)^3
\right] \\[2mm]

&= 16 \left[
\frac{\pi}{4} - \left( \frac{\pi}{4} \right)^2
\right] \\[2mm]

&= 4 \pi \left[ 1 - \frac{\pi}{4} \right] \\[2mm]

&= 4 \pi - \pi^2
\end{align}
$$ The below Excel screenshot shows the math needed to obtain a suitable value of $N$, which turns out to be just over the value of 100,000 which I used for my simulations.




$\pi$ approximation source code (C++)


Below is the full source code for my program, as well as the result and the first few lines of the output file "Program Output.txt". Underneath, I have added some commentary about what the different sections do.


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
#include<iostream>
#include<fstream>
#include<random>
#include<vector>
#include<cmath>
#include<chrono>
using namespace std;

class point //represents a point (x,y) in the plane
{
private:
    double x;
    double y;
public:
    point(double _x, double _y): x(_x), y(_y) {} //constructor
    void assign_vals(double _x, double _y) {x = _x; y = _y;}
    double len(void){ //distance from the point to the origin
        return pow(pow(x,2)+pow(y,2),0.5);
        }
};

// obtain a seed from the system clock:
unsigned seed = std::chrono::system_clock::now().time_since_epoch().count();
static std::mt19937 gen(seed);
double rand_num(){
    static std::uniform_real_distribution<double> urd(-1.0,1.0); //uniform random real number between -1 and 1
    return urd(gen); //use gen as a generator
}

int main()
{
    //drop N random points into the square(-1,1)x(-1,1)
    //check if they are within circle of radius 1
    //circle area = pi*(1^2) = pi
    //square area = 2*2 = 4
    //pi estimate = 4 * (# pts in circle / # pts total)
    const int N = 100000; //number of sims
    vector<bool> results(N,0); //vector of N 0's to store intermediate results
    vector<double> pi_approx_vect(N,0); //vector of N 0's to store approximation results
    double x, y, r;
    int sum=0; //sum counter dummy variable
    double pi_approx;
    point * p = new point(0,0); //pointer to point

    ofstream myfile ;
    myfile.open("Program Output.txt", ios::trunc);

    for(int j=1; j<=results.size(); j++){
        x = rand_num(); //random number btw -1.0 and 1.0
        y = rand_num(); //'''
        p->assign_vals(x,y);
        r = p->len();

        if (r < 1){results[j] = 1;}
        else {results[j] =0;}

        sum = sum + results[j];

        pi_approx_vect[j] = 4 * sum / static_cast<double>(j);

        if (myfile.is_open())
        {
        myfile << results[j] << endl;
        }

    }

    delete p;
    pi_approx = 4 * sum / static_cast<double>(N);

    cout << "N: " << N << endl;
    cout << "pi approximation: " << pi_approx;

    myfile.close();

    return 0;
}

C++ program output

"Program Output.txt" output

My notes on the program:

  • Lines 1-7: C++ standard libraries included for various purposes
  • Lines 9-20: definition of the "point" class: in the "main" program, the class is instantiated to create a new "point" object $p$, which has members called $x$ and $y$. Since these members are declared using the "private" keyword in the class definition, the "main" program cannot access them directly. Instead, it will populate their values using the constructor (e.g. when they are initiated with 0's in the variable declaration in line 43) or the "assign_vals" function (in line 51). The "len" function returns a number representing the "radius", $\sqrt{x^2+y^2}$, of the point.
  • Lines 22-28: "seed" the random number generator using the current clock time, which ensures we will get different random numbers each time we run the program. I also define the "rand_num()" function to generate random numbers from the uniform distribution on $[-1,1] \times [-1,1]$.
  • "main" program, lines 37-43: variable declarations
  • "main" program, lines 48-66: looping from $j=1$ to $N$, populate the point $p$ with random values of $x$ and $y$, and if the radius $r<1$, populate the $j$-th entry of the "results" vector with a 1. If $r \geq 1$, then "results[j]" is populated with a 0 instead.
  • "main" program, line 69: the $\pi$ approximation is calculated as 4 times the sum of the "results" vector (i.e. the number of "successes") divided by $N$. Since both $N$ and "sum" are integer variables, "static_cast<double>" is included to ensure that the quotient is calculated as a decimal number rather than being rounded off to the nearest integer.
Any questions or constructive criticism on the code is more than welcome in the Comments section- I am fairly new to C++, so some of the above may be unconventional. Note that it is certainly possible to write the same simulation without using classes, but I did it this way to experiment a bit as a learning exercise for myself.

That will do it for this post. Thanks for reading, and thanks to "Anonymous" for the great Reader Request.