## Proof of the Law of Large Numbers

Preliminaries: Monte Carlo Simulation - Part 2, How do mathematicians model randomness?

In Monte Carlo Simulation - Part 2, I introduced the Weak and Strong Laws of Large Numbers as well as the Central Limit Theorem. While the proof of the latter is a bit more technical, I wanted to include a rather intuitive proof of the (Weak) Law of Large Numbers.

The LLN essentially states that, given a sufficiently large sample size, a random sample mean is unlikely to fall far away from the long-run mean. Thus, it should come as no surprise that the standard proof consists of taking the limit as $n \rightarrow \infty$ of an inequality limiting the likely "spread" of a random variable from its mean.

 The Law of Large Numbers states that, for large sample sizes, sample means will likely converge to the theoretical mean. (image courtesy of Wikipedia)

Below, I'll prove two random variable inequalities called Markov's Inequality and Chebyshev's Inequality and then show you how the LLN falls right out of the latter when we take a limit.

Markov's Inequality: If $X$ is a non-negative random variable, then the probability that $X$ exceeds a positive value $a$ is no greater than the mean of $X$ divided by $a$. In symbols: $${\Bbb P}(X \geq a) \leq \frac{{\Bbb E}(X)}{a}$$ Proof: Let $I$ be the indicator random variable for the event $X \geq a$. In other words, $I$ is a random variable which equals 1 if $X \geq a$ and 0 otherwise. Then it is always the case that $aI \leq X$:

• If $X<a$, then $I=0$, so $aI=0 \leq X$, since $X$ is non-negative.
• If $X \geq a$, then $I=1$, so $aI = a \cdot 1 = a \leq X$ since $X \geq a$.
Since $aI \leq X$ is always true, the means must satisfy the inequality ${\Bbb E}(aI) \leq {\Bbb E}(X) \ ( \spadesuit )$. But the left side of this inequality is equal to:
\begin{align} {\Bbb E}(aI) &= a {\Bbb E}(I) \\ &= a(1\cdot {\Bbb P}(I=1) + 0 \cdot {\Bbb P}(I=0)) \\ &= a {\Bbb P}(I=1) \\ &= a {\Bbb P}(X \geq a) \tag{\dagger} \end{align} Substituting $( \dagger )$ into $( \spadesuit )$ gives
$$a {\Bbb P}(X \geq a) \leq {\Bbb E}(X)$$ Dividing both sides by $a$ yields the desired inequality.
$\square$

Chebyshev's Inequality: Let $X$ be a random variable with mean $\mu$ and non-zero, finite variance $\sigma^2$, and let $k \geq 1$. Then the probability is at most $\frac{1}{k^2}$ that $X$ falls at least $k$ standard deviations from its mean. In symbols: $${\Bbb P}( | X - \mu | \geq k \sigma ) \leq \frac{1}{k^2}$$ Chebyshev's inequality puts a maximum on how frequently a random variable will exhibit values far away from its mean.

Proof: $|X - \mu| \geq k \sigma$ if and only if $(X-\mu)^2 \geq (k \sigma)^2$. Since $(X-\mu)^2$ is non-negative, we can apply Markov's inequality with $a = (k \sigma)^2$: $${\Bbb P}\left[ (X- \mu)^2 \geq (k \sigma)^2 \right] \leq \frac{{\Bbb E}\left[ (X-\mu)^2 \right]}{(k \sigma)^2}$$ But ${\Bbb E}\left[ (X-\mu)^2 \right]$ is the definition of the variance $\sigma^2$, so $${\Bbb P} \left[ (X- \mu)^2 \geq (k \sigma)^2 \right] \leq \frac{\sigma^2}{(k \sigma)^2} = \frac{1}{k^2}$$ as desired.
$\square$

Now, we can apply Chebyshev's inequality to the sample mean $\overline{X}_n$ to obtain the Weak Law of Large Numbers. For the below, let $X_1, X_2, X_3, \dotsc$ be iid random variables with mean $\mu$ and finite variance $\sigma^2$.

Proof of the (Weak) Law of Large Numbers: In Monte Carlo Simulation - Part 2, we saw that the expected value of the sample mean is the long-run mean, i.e. ${\Bbb E}(\overline{X}_n) = \mu$. Furthermore, the variance of $\overline{X}_n$ is 
\begin{align}
{\rm var} \left( \overline{X}_n \right) &= {\rm var} \left(\frac{1}{n} \sum_{i=1}^{n}{X_i} \right) \2mm] &= \frac{1}{n^2} \sum_{i=1}^{n}{{\rm var}(X_i)} \\[2mm] &= \frac{n \sigma^2}{n^2} = \frac{\sigma^2}{n} \end{align}  Now, let \epsilon > 0. Then  \begin{align} k \sigma_{\overline{X}_n} = \epsilon &\iff k^2 {\rm var} \left( \overline{X}_n \right) = \epsilon^2 \\[2mm] &\iff \frac{1}{k^2} = \frac{{\rm var}\left( \overline{X}_n \right)}{\epsilon^2} = \frac{\sigma^2}{n \epsilon^2} \end{align}  Thus, by Chebyshev's inequality,  {\Bbb P}\left( \left| \overline{X}_n - \mu \right| \geq \epsilon \right) \leq \frac{1}{k^2} = \frac{\sigma^2}{n \epsilon^2}  Taking the limit of both sides yields  \lim_{n \rightarrow \infty}{\Bbb P}\left( \left| \overline{X}_n - \mu \right| \geq \epsilon \right) \leq \lim_{n \rightarrow \infty}{\frac{\sigma^2}{n \epsilon^2}} = 0  Finally, since probabilities are non-negative, the limit must be equal to zero. \square I hope this post helped clarify what the Law of Large Numbers says and why it's true. Thanks for reading, and feel free to ask any questions in the Comments section. ## Monte Carlo Simulation - Part 2 Preliminaries: How do mathematicians model randomness?, Monte Carlo Simulation - Part 1 In this post, a continuation of the reader-requested Monte Carlo Simulation - Part 1, I will present another application of Monte Carlo methods and a concrete example (the promised \pi approximation), including the C++ source code so that you can see the "guts" of a Monte Carlo simulation. I will also explain how the \pi example relates to the Law of Large Numbers and Central Limit Theorem, key probabilistic results which apply beyond just Monte Carlo methods. ### Monte Carlo approximation of definite integrals (and \pi) In Part 1, I presented a retirement planning asset allocation problem as an example of when one may employ a Monte Carlo method. Given the enormous number of possible scenarios under the many statistical assumptions, this problem was a prime Monte Carlo candidate. I will now show you another use of Monte Carlo methods: estimating the value of a definite integral \int_{a}^{b}{f(x) \, dx}, i.e. the area under a curve f(x) between x=a and x=b, when we cannot calculate it analytically (via the antiderivative). This is especially useful for multidimensional integrals, but I will stick to single-dimensional integrals here just to illustrate the method. For example, suppose we would like to estimate the value of the definite integral I=\int_{0}^{\pi}{\frac{\sin(x)}{x} \, dx}. We have a computer choose many random sample points (from the uniform distribution on the rectangle \left[ 0, \pi \right] \times \left[ 0,1 \right]) and count how many fall below the curve (the green points in the below image, courtesy of Google): In the image above, the computer has chosen 314,028 points so far, of which 185,208 were below the curve and 128,820 were above. We thus estimate the value of I as \frac{{\text 185,208}}{{\text 185,208}+{\text 128,820}} \cdot \pi \approx 1.852854. This is the percentage of sample points in the rectangle which fell below the curve, multiplied by the total area of the sample rectangle (1 \cdot \pi). To approximate the value of \pi, we can use the same technique: we draw a circle of radius 1 centered at (0,0) and choose N random sample points (x,y) from the uniform distribution on the square \left[ -1, 1 \right] \times \left[ -1,1 \right]. The total area of the square is 4, and the area of the circle is \pi \cdot 1^{2} = \pi. Thus, we would expect the percentage of sample points falling within the circle (the red points in the below animation) to be approximately \frac{\pi}{4}. After running our program, we can count the number N_1 of sample points which fell within the circle (a point falls within the circle if its radius is less than 1, i.e. if \sqrt{x^2 + y^2}<1). Our \pi approximation is then \pi \approx 4 \tfrac{N_1}{N}.  Animation of the Monte Carlo method to estimate \pi (image courtesy of Wikipedia) The below chart shows three of my simulation runs using the above method, with 100,000 sample points per run (N= \text{100,000}). Based on the chart, the approximation seems to generally improve as N increases; while this is not exactly surprising, the Law of Large Numbers, which I will explain in the next section, formalizes this notion. The Law of Large Numbers states that approximations like this one will always yield sample averages close to the true value as N \rightarrow \infty. In other words, I didn't just run my simulation 100 times and choose three "lucky" results to show you. Digging a bit deeper, it appears that N=\text{10,000} would suffice for an approximation within about 0.02-0.03 of the actual value, but if we want precision within 0.01, we need an N somewhere in the neighborhood of 50-100,000 samples. The Central Limit Theorem, which I will also address later in this post, tells us exactly how large we need to make N in order to obtain, with a given level of confidence (e.g. 95%), an approximation with a given level of precision (e.g. 0.01). I have included the C++ code for my simulation at the end of the post so that interested readers can see the details. ### The Law of Large Numbers The Law of Large Numbers formalizes the notion that sample averages converge to the long-run average. There are actually two versions, known as the Weak Law and Strong Law of Large Numbers, which correspond to different types of random variable convergence. Let's take a look at each and interpret them in the context of the \pi simulation. For the below definitions, let X_1, X_2, X_3, \dotsc be a sequence of random variables, independent from each other, which all have the same probability distribution. These are called independent and identically distributed (iid for short). Since the distributions are identical, the mean of each X_i is the same value, \mu, which we will assume is finite. The symbol \overline{X}_n will denote the sample average, defined as \frac{\sum_{i=1}^{n}{X_i}}{n} . The Weak Law of Large Numbers states that, as n \rightarrow \infty, the sample average converges in probability to \mu, which means that, for any threshold \epsilon > 0 (no matter how small), we have  \lim_{n \rightarrow \infty} {\Bbb P} \left( \left| \overline{X}_n - \mu \right| > \epsilon \right) = 0  In other words, for any \epsilon specified (e.g. one thousandth, one millionth, etc.), the probability that the sample average will be further than \epsilon from \mu can be made arbitrarily small (but not necessarily zero) by choosing a large enough sample size n. The Strong Law of Large Numbers states that the sample means converge almost surely to \mu (almost surely means "with probability 1"), or in symbols,  {\Bbb P} \left( \lim_{n \rightarrow \infty}{\overline{X}_n} = \mu \right) = 1  This is equivalent to the statement  {\Bbb P} \left( \lim_{n \rightarrow \infty}{\overline{X}_n} \neq \mu \right) = 0  In other words, the probability of a sample outcome in which there is not an m large enough that \overline{X}_n is within \epsilon of \mu whenever n>m, is equal to zero. This is a stronger statement than the Weak Law in that it implies the Weak Law. For the \pi simulation, a single X_i is the outcome of the experiment of choosing a random sample point within the square \left[ -1, 1 \right] \times \left[ -1, 1 \right] and assigning X_i = 1 if the point is within the circle of radius 1 or X_i = 0 if not. Thus, the expected value of each X_i is \mu = \pi / 4, since  \begin{align} {\Bbb E}(X_i) &= 0 \cdot {\Bbb P}(X_i = 0) + 1 \cdot {\Bbb P}(X_i = 1) \\ &= 0 \cdot (1 - \pi / 4) + 1 \cdot (\pi / 4) \\ &= \pi / 4 \end{align}  The Run 1, 2, and 3 lines in the chart above represent the random variable 4 \overline{X}_n (with n on the horizontal axis), and 4X_i has expected value \pi, the dotted red line. Both Laws of Large Numbers essentially tell us that it is unlikely for the simulated values to be too far from \pi when n grows large enough. More precisely, take \epsilon = 0.0001 as an example error threshold and 99.9% as an example probability threshold. The Weak Law tells us that there exists some m large enough that, if we ran a new Run 4 simulation, there would be a probability of at least 99.9% that its line would remain within a band 0.0001 above or below the dotted red line everywhere to the right of n=m. The Strong Law tells us that there is a probability of exactly 100% that, for some large enough m, the Run 4 line would remain within the above-mentioned error band to the right of n=m. Thus, the Strong Law clearly implies the Weak Law, but while there is a subtle difference in interpretation, they both essentially say that, in some sense, \overline{X}_n \rightarrow \mu as n \rightarrow \infty. Note: for a proof of the Law of Large Numbers using inequalities about the likely "spread" of a random variable around its mean, see this post. What the Law of Large Numbers does not tell us is how large this "m" value needs to be. This is where the Central Limit Theorem comes in. ### The Central Limit Theorem We know that the expected value of a sample mean \overline{X}_n is the population mean \mu, since  {\Bbb E}(\overline{X}_n) = {\Bbb E}\left( \frac{\sum_{i=1}^{n}{X_i}}{n} \right) = \frac{\sum_{i=1}^{n}{{\Bbb E}(X_i)}}{n} = \frac{n \mu}{n} = \mu  The Law of Large numbers tells us that, with a very high probability, \overline{X}_n will indeed become close to its expected value for large enough values of n. The Central Limit Theorem goes a step further and actually gives us the full probability distribution of \overline{X}_n, known as the sampling distribution of the mean. Central Limit Theorem: If X_1, X_2, X_3, \dotsc are iid with mean \mu and finite variance \sigma^2, then, as n \rightarrow \infty, the random variable \overline{X}_n converges in distribution to a Normal random variable with mean \mu and variance \frac{\sigma^2}{n}. In this context, convergence in distribution means that the sampling distribution of \overline{X}_n converges pointwise to the applicable Normal distribution. What's amazing about the Central Limit Theorem is that the limiting distribution of \overline{X}_n is Normal regardless of the distribution of the X_i's themselves. In the \pi example, the distribution of X_i is obviously not Normal: it's the discrete distribution  p(x) = \begin{cases} 1 - \pi / 4 & \text{if x=0} \\ \pi / 4 & \text{if x=1} \end{cases}  Nonetheless, the CLT tells us that the sampling distribution of \overline{X}_n will still be approximately Normal for large enough values of n. In practice, the rule of thumb is that the Normal approximation is valid when n is at least 30; this "For Dummies" article provides a bit more color on this topic. The below image from Wikipedia shows the sampling distribution of the mean for differing sizes of n, using the example of coin flips (the X_i's here are the same as the \pi approximation except that they have "success" probability 1/2 instead of \pi/4):  Sampling distribution of the mean \frac{1}{n}\sum_{i=1}^{n}{X_i} for various values of n, where each X_i is 1 for heads and 0 for tails. Notice that the CLT's Normal approximation seems to be most accurate starting around n=30. (image courtesy of Wikipedia) We can use the CLT-provided Normal approximation to determine how many simulations we need in order to have at least, say, 95% probability of obtaining a \pi estimate within, say, 0.01 of the true value of 3.1415[...]. In order to do this, we need to find the variance of the random variable 4 X_i. This is given by the computation  \begin{align} \sigma^2_{4 X_i} &= 4^2 \cdot \sigma^2_{X_i} = 16 \cdot {\Bbb E}\left[ (X_i - \mu_{X_i})^2 \right] \\[2mm] &= 16 \bigg[ \underbrace{\left( 1 - \frac{\pi}{4} \right)}_{p(0)} \left(0 - \frac{\pi}{4} \right)^2 + \underbrace{\frac{\pi}{4}}_{p(1)} \left( 1 - \frac{\pi}{4} \right)^2 \bigg] \\[2mm] &= 16 \left[ \left(\frac{\pi}{4} \right)^2 - \left( \frac{\pi}{4} \right)^3 + \frac{\pi}{4} - 2 \left( \frac{\pi}{4} \right)^2 + \left( \frac{\pi}{4} \right)^3 \right] \\[2mm] &= 16 \left[ \frac{\pi}{4} - \left( \frac{\pi}{4} \right)^2 \right] \\[2mm] &= 4 \pi \left[ 1 - \frac{\pi}{4} \right] \\[2mm] &= 4 \pi - \pi^2 \end{align}  The below Excel screenshot shows the math needed to obtain a suitable value of N, which turns out to be just over the value of 100,000 which I used for my simulations. ### \pi approximation source code (C++) Below is the full source code for my program, as well as the result and the first few lines of the output file "Program Output.txt". Underneath, I have added some commentary about what the different sections do.   1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 #include #include #include #include #include #include using namespace std; class point //represents a point (x,y) in the plane { private: double x; double y; public: point(double _x, double _y): x(_x), y(_y) {} //constructor void assign_vals(double _x, double _y) {x = _x; y = _y;} double len(void){ //distance from the point to the origin return pow(pow(x,2)+pow(y,2),0.5); } }; // obtain a seed from the system clock: unsigned seed = std::chrono::system_clock::now().time_since_epoch().count(); static std::mt19937 gen(seed); double rand_num(){ static std::uniform_real_distribution urd(-1.0,1.0); //uniform random real number between -1 and 1 return urd(gen); //use gen as a generator } int main() { //drop N random points into the square(-1,1)x(-1,1) //check if they are within circle of radius 1 //circle area = pi*(1^2) = pi //square area = 2*2 = 4 //pi estimate = 4 * (# pts in circle / # pts total) const int N = 100000; //number of sims vector results(N,0); //vector of N 0's to store intermediate results vector pi_approx_vect(N,0); //vector of N 0's to store approximation results double x, y, r; int sum=0; //sum counter dummy variable double pi_approx; point * p = new point(0,0); //pointer to point ofstream myfile ; myfile.open("Program Output.txt", ios::trunc); for(int j=1; j<=results.size(); j++){ x = rand_num(); //random number btw -1.0 and 1.0 y = rand_num(); //''' p->assign_vals(x,y); r = p->len(); if (r < 1){results[j] = 1;} else {results[j] =0;} sum = sum + results[j]; pi_approx_vect[j] = 4 * sum / static_cast(j); if (myfile.is_open()) { myfile << results[j] << endl; } } delete p; pi_approx = 4 * sum / static_cast(N); cout << "N: " << N << endl; cout << "pi approximation: " << pi_approx; myfile.close(); return 0; }   C++ program output  "Program Output.txt" output My notes on the program: • Lines 1-7: C++ standard libraries included for various purposes • Lines 9-20: definition of the "point" class: in the "main" program, the class is instantiated to create a new "point" object p, which has members called x and y. Since these members are declared using the "private" keyword in the class definition, the "main" program cannot access them directly. Instead, it will populate their values using the constructor (e.g. when they are initiated with 0's in the variable declaration in line 43) or the "assign_vals" function (in line 51). The "len" function returns a number representing the "radius", \sqrt{x^2+y^2}, of the point. • Lines 22-28: "seed" the random number generator using the current clock time, which ensures we will get different random numbers each time we run the program. I also define the "rand_num()" function to generate random numbers from the uniform distribution on [-1,1] \times [-1,1]. • "main" program, lines 37-43: variable declarations • "main" program, lines 48-66: looping from j=1 to N, populate the point p with random values of x and y, and if the radius r<1, populate the j-th entry of the "results" vector with a 1. If r \geq 1, then "results[j]" is populated with a 0 instead. • "main" program, line 69: the \pi approximation is calculated as 4 times the sum of the "results" vector (i.e. the number of "successes") divided by N. Since both N and "sum" are integer variables, "static_cast<double>" is included to ensure that the quotient is calculated as a decimal number rather than being rounded off to the nearest integer. Any questions or constructive criticism on the code is more than welcome in the Comments section- I am fairly new to C++, so some of the above may be unconventional. Note that it is certainly possible to write the same simulation without using classes, but I did it this way to experiment a bit as a learning exercise for myself. That will do it for this post. Thanks for reading, and thanks to "Anonymous" for the great Reader Request. ## Monte Carlo Simulation - Part 1 Preliminary: How do mathematicians model randomness? In this two-part post, I will introduce Monte Carlo methods, an umbrella term for techniques involving many random simulations of a phenomenon of interest to obtain a numerical result (or range of results). This post is a response (albeit a late one- my apologies for that) to a reader request from Anonymous dated August 2016. In Part 1, I will explain at a high level what a Monte Carlo simulation is, what kinds of typical inputs and outputs we may expect, and the benefits and limitations of Monte Carlo methods. I use the example of retirement/investment planning software, which Anonymous also mentioned in his request. In Part 2, I will walk through a considerably simpler example in detail: a Monte Carlo method to approximate the value of \pi. I will also explain two well known statistical results, the Law of Large Numbers and the Central Limit Theorem, which justify the use of Monte Carlo methods and allow us to quantify their accuracy. The \pi simulation will directly illustrate the LLN and CLT. ### Monte Carlo methods - high-level overview In the most general terms, a Monte Carlo simulation is a method whereby we simulate one or many random inputs many times (usually thousands or millions), making assumptions as to their probability distributions, and combine the results in a deterministic (i.e. non-random) way to obtain a numerical output. In this section, I will explain in layman's terms why and how we may do this as well as some limitations of these methods. When to employ a Monte Carlo method In the reader request, Anonymous mentioned a typical situation in which an analyst may employ a Monte Carlo method: a wealth manager wishes to build a portfolio of various assets for a client with the goal of providing sufficient income for the client after his retirement. More precisely, for a prospective portfolio to be considered acceptable, it must have a sufficiently high probability of providing enough cash to cover the client's estimated living costs at all times after his retirement date. Suppose we would like to assess the viability of a proposed portfolio consisting of a number of equities (stocks) and fixed-income assets (bonds). We will need to make static assumptions about the client's age, retirement age, initial portfolio size, and post-retirement income and living costs. We will also need to make certain assumptions regarding the probability distributions of interest rates, equity returns, debt yields, dividend rates, inflation, mortality, etc. Note that we may re-categorize some of the "static" assumptions as variable and vice-versa, depending on the goals of the analysis. In any case, with so many variable inputs, it is no simple task to determine the probability that the proposed portfolio will be acceptable at any given time, let alone at all times. The number of random inputs could be enormous, and we do not necessarily have a tractable way of combining them all to arrive at a probability distribution for the portfolio value at some time t, so we can't solve this out analytically. However, we do have clear assumptions for all the inputs, and we have computers. Instead of an analytical solution, we can use a Monte Carlo simulation to arrive at a numerical solution. Typical inputs, outputs, and interpretation I mentioned before that a Monte Carlo simulation consists of many "runs" of a deterministic procedure based on randomized inputs. To perform a single run of a Monte Carlo simulation, we program a computer to simulate each input by drawing random numbers from an assumed probability distribution. In the retirement example, we would randomly generate a sample path over time for each relevant interest rate, stock price, bond price, etc. Given these paths, we can (deterministically!) compute the value of the portfolio at each time, and then it is a matter of arithmetic to determine whether the portfolio covered the living costs at each time or failed. This run represents one of the many possible scenarios which could occur. We repeat this process for, say, 100,000 runs (in Part 2, I will elaborate further on how many runs we need to use), calculating the simulated probability of the portfolio's success as the number of successes divided by 100,000. If that probability exceeds a predetermined threshold (e.g. 90%), then we deem the portfolio "acceptable". A typical software would also provide more sophisticated chart outputs such as the below (which I conveniently found on Google). The first image shows median portfolio values at each time based on certain fixed input parameters, with the median taken over all the simulation runs. The second image shows a "heat map" of success probabilities based on different values of the input parameters. Each pixel on the heat map summarizes a full probability distribution (presumably of portfolio shortfall) for a fixed set of values of the input parameters: a green pixel indicates that the input parameter values (the x- and y-axis values) lead to a portfolio which is likely to cover costs in each year after retirement, while a red pixel indicates that the parameter values lead to a portfolio which is more likely to fail, i.e. to not cover costs (even with previous years' excesses) in at least one year after retirement. The software must make assumptions about the statistical properties of the portfolio in order to generate chart outputs like the above. With respect to the probabilistic inputs (these do not include the deterministic inputs such as retirement age, tax rates, initial portfolio size, etc.), the images show only a mean and standard deviation of the investment returns. Therefore, this software likely assumes Normally distributed returns. It's also worth noting that, for example, Life Expectancy seems to be an input here with a value of 95. This suggests that the simulations may not be randomizing that input. In interpreting this output, it would be important to read the software documentation and understand the major assumptions employed. Benefits and limitations of Monte Carlo methods The most obvious benefit of a Monte Carlo simulation is that it allows us to run millions of complex scenarios in just seconds or minutes, a task which would be impossible without a computer. Another key benefit is a bit more subtle. If we didn't have computers available, we may attempt to answer our retirement question using "what-if scenarios": we would choose a few sets of very simple assumptions about equity and debt returns which allow us to use "back-of-the-envelope" arithmetic to calculate the portfolio value. For example, we may assume fixed, non-random annual returns with three cases as follows: We may even go a step further and assume probability distributions for these returns (e.g. Normal distributions) which allow us to arrive at an analytical solution for the portfolio value's probability distribution over time. This is certainly a reasonable approach; however, these what-if scenarios give us no indication of how likely the three different cases are. A Monte Carlo simulation, on the other hand, gives us a distribution of possible outcomes and the probabilities of those outcomes. Unlikely scenarios will not be over-represented among the simulation runs. Of course, no method is without limitations. On the practical side, without readymade software, we typically need to write our own code to perform a Monte Carlo simulation; this takes time and effort which may not be necessary. For very complex simulations, we may need to worry about the integrity of the random number generator(s) and, in some cases, the speed of the code, though these are beyond the scope of this post. From a more qualitative standpoint, we have the "garbage in, garbage out" principle: a Monte Carlo simulation, like any analysis, is based on certain input assumptions. We must always keep in mind which assumptions went into a simulation when interpreting its results. These assumptions include both input probability distributions and the input parameter values used to calibrate those distributions. For example, the chart outputs above probably assumed "normal" market conditions and Normally distributed returns. The above simulation may therefore underestimate the probability of portfolio failure due to a financial crisis (a non-normal and certainly non-Normal event). Similarly, if life expectancy was indeed a static input, then the analysis may underestimate the probability of failure for people who are very healthy and thus likely to live longer than average. Finally, we must determine "reasonable" values of the input parameters. More often than not, we base these on historical data as our best guess. This implies that the simulation will weight possible scenarios based on their historical relative frequencies, while future frequencies may differ. On the other hand, assuming different parameter values based on our own informed estimates clearly introduces a different bias into the model. In conclusion, Monte Carlo simulation is a powerful tool, but it is still just that: a tool. We must remain critical of our analysis and the assumptions that underlie it. Stay tuned for Part 2, in which we will walk through a simple Monte Carlo method to estimate \pi and introduce the Law of Large Numbers and Central Limit Theorem. ## How do mathematicians model randomness? Preliminaries: set notations In this post, I will lay out the mathematical formalism of probability theory and explain the intuition behind its basic components. This post will begin with the abstract concept of probability spaces, but I'll show you how the formalism is cleverly set up so that, in practice, we only ever need to directly interact with the more tangible notion of probability distributions of random variables. If you've ever studied coin tosses, dice rolls, or bell curves, then you'll see that you're already familiar with probability distributions. After introducing probability distributions, I'll show you a few of the most important ones and wrap up with a simple example of how to use them. For readers not interested in technical details, you might still try skimming the next section as probability spaces are central to the definition of random variables. However, if you're truly averse, you can probably get by without it. ### The probability space In order to model random phenomena, we should have some sort of "driver" of randomness in the model. Mathematicians call this a probability space, and it consists of 3 objects, typically denoted \Omega, \mathcal{F}, and {\Bbb P}. \Omega is called the sample space, and it is a non-empty set whose elements represent some inherently uncertain outcome. For example, an element \omega \in \Omega could be the result of a coin toss or dice roll or, more generally, an overall state of nature or of the world/universe. The sample space is purposely allowed to be vaguely specified, and we will see below how the formalism allows us to not really worry about what \Omega is. This is because we never need to deal with \Omega directly; instead, as mentioned in the introduction, we only interact directly with probability distributions of random variables, both of which will be defined below, so we'll come back to this topic. \mathcal{F} is a \sigma-algebra, a special type of set containing possible events to which we can assign probabilities. The purpose of this post is to explain probability without going into the details of measure theory (which is what you will find in some of the Wikipedia articles), so I am not even going to give a formal definition of a \sigma-algebra here. What you need to know is that the elements of \mathcal{F} are subsets of \Omega and that \mathcal{F} will include any "event" you can reasonably conceive, i.e. anything built using AND's (intersections), OR's (unions), and NOT's (complements) of other known events, including (countably) infinite combinations of these. The only subsets of \Omega which \mathcal{F} will not include are the non-measurable ones, which never arise in practical applications anyhow. Please refer to my axiom of choice post for more color on non-measurable sets (and why you shouldn't lose any sleep over them). Finally, {\Bbb P} is a probability measure, a function which assigns to each event (i.e. to each element of \mathcal{F}) a number between 0 and 1, its probability. In symbols, {\Bbb P}: \mathcal{F} \rightarrow \left[ 0,1 \right]. A probability measure has two properties in addition to returning values in [0,1]: • Countable additivity: given a collection of pairwise disjoint sets A_1, A_2, A_3, \dotsc in \mathcal{F} (pairwise disjoint means none of them overlap, i.e. A_i \cap A_j = \emptyset for all i,j), {\Bbb P}\left( \bigcup_{i=1}^{\infty}{A_i} \right) = \sum_{i=1}^{\infty}{{\Bbb P}(A_i)}. In other words, probabilities add for disjoint events. • {\Bbb P}(\Omega) = 1, i.e. the probability of everything is 1. These properties imply the other familiar probability results such as {\Bbb P}(A^{C}) = 1 - {\Bbb P}(A) and, if A \subseteq B, {\Bbb P}(A) \leq {\Bbb P}(B). With its 3 components, a probability space (\Omega, \mathcal{F}, {\Bbb P}) serves as the "input" for a model of a random phenomenon. In the next section, we will look at the "output". ### Random variables A random variable (r.v.) is a function^{\dagger} which, given the outcome of a random experiment/state of nature/etc. (represented by an element \omega \in \Omega), returns a numerical output of interest. The nomenclature is a bit misleading as a random variable is a function, not a variable. Furthermore, it is not random per se, as the randomness is already captured in the input, \omega. \dagger: technically, a random variable must be a measurable function, a restriction meant to exclude functions such as f(x)=1 if x \in A and f(x)=0 if x \notin A, where A is a non-measurable set. While this is a nice example to test the rigor of the theory, it is clearly not the type of function that would ever arise in a practical application, since we would need to invoke the axiom of choice to define a set such as A. In short, we don't need to worry about the measurability restriction in practical applications. As a concrete example, let \Omega represent the set of possible outcomes of rolling five 6-sided dice, so a sample element would be \omega = (1,4,2,2,6). In this case, \Omega is finite with 6^5 = 7{,}776 elements. Define a function X: \Omega \rightarrow {\Bbb R} where X(\omega) is the sum of the 5 rolls in \omega. Note that X is a deterministic (i.e. not random) function based on the random outcome of the dice rolls. To compute the probability that X=29 (for example), we add up the probabilities of the \omega's which will make X equal 29. The only way to have X=29 is to have one roll result in 5 and the rest result in 6. Thus,  \begin{align} {\Bbb P}(X=29) &= {\Bbb P}\left( \lbrace \omega \in \Omega: X(\omega)=29 \rbrace \right) \\ &= {\Bbb P}(\text{one 5 and four 6's}) \\ &= (5 \cdot (1/6)) \cdot (1/6)^4 \\ &\approx .06\% \end{align}  Note for technical readers: by definition, {\Bbb P} takes elements of \mathcal{F} (i.e. subsets of \Omega) as inputs and returns a number 0 through 1, but we have written "{\Bbb P}(X=29)" above. In this slight abuse of notation, we have implicitly used X to "push {\Bbb P} forward" to a probability measure X_{*}{\Bbb P} (the pushforward measure) which takes sets of output values of X as inputs (as opposed to taking subsets of \Omega, which actually consist of input values for X). Since the interpretation is not ambiguous, it is very common to see an expression such as {\Bbb P}(X=29) written here in place of the more correct X_{*}{\Bbb P}(\lbrace 29 \rbrace). For a more complicated example of a random variable, let X be the function from \Omega to {\Bbb R}^{+} (positive real numbers- thus, we are ignoring finite minimum tick sizes) which returns the price of some stock at a future time t. In this case, an element \omega \in \Omega may represent the history of all trades ever in this stock, which in turn rely on too many phenomena to even imagine. In order to analyze the probability that X takes on a certain value or range of values, we would need to make some assumptions. At the end of this post, we'll see how to do this. ### Probability distribution of a random variable Going back to the example of rolling 5 dice and computing their sum, the possible outputs of X are the integers 5 through 30. Let's call this set of integers R (for the "range" of X). We can define a function f: R \rightarrow [0,1] by f(r) = {\Bbb P}(X=r), whose graph would look like this: Note that f(r) must be between 0 and 1 for each r (i.e. the y-axis starts at 0 and goes up to at most 1) since probabilities are always between 0 and 1. Such a function f is called the probability mass function (pmf), or just probability distribution, of the random variable X. It is sometimes more convenient to use the cumulative mass function (cmf) of X, which is defined by F(r) = \small{\displaystyle \sum_{j \leq r}}{f(j)}. From this formula, we can see that F(r) represents the probability that X takes on a value less than or equal to r; clearly, F is increasing, i.e. F(r_1) \leq F(r_2) when r_1 \leq r_2, and F(30) = 1. f and F contain the same information, so we can choose the one which is most convenient for a particular use. Note also that due to the definitions of f and F as probabilities, it must be the case that \small{\displaystyle \sum_{r \in R}}{f(r)} = 1, which is equivalent to the statement above that F(30) = 1. Note: some authors use the term "probability distribution" to refer to the cumulative mass function. In this post and others, I will explicitly use the word "cumulative" if applicable and will use "probability distribution" to refer to the non-cumulative version. The random variable above is a discrete random variable, meaning it can only take on a countable (in this case, finite) number of values. The stock price example above is a continuous random variable, meaning it can take on an uncountable number of values (in plain English: values from a continuum), in this case {\Bbb R}^{+}. Probability distributions for continuous random variables need to be treated differently since there are so many possible values that the probability of any particular value is effectively zero. Instead, we need to think in terms of ranges of possible values. In this case, we define the probability density function (pdf) f: {\Bbb R}^{+} \rightarrow [0,1] as a function for which  {\Bbb P}(a \leq X \leq b) = \int_{a}^{b}{f(x)dx}  Instead of adding the probabilities of values for which X falls between a and b, we need to integrate, which is the continuous analog of summation. By analogy to distributions of mass in physics, we also use the word "density" for continuously distributed probability to replace "mass" from the discrete case. However, the term probability density function is more general, and so we often use this term (or its abbreviation, pdf) in the discrete case as well. Similarly, the cumulative density function (cdf) F would be defined by  F(x) = {\Bbb P}(X \leq x) = \int_{- \infty}^{x}{f(t)dt}  where, once again, the summation from the discrete formula has been replaced by integration. In the stock price example, the pdf of X may look something like this: As in the discrete case, the probability interpretation of f and F dictates that  \int_{-\infty}^{\infty}{f(x)dx} = 1  which is equivalent to \displaystyle \lim_{x \rightarrow \infty}F(x) = 1. When the underlying sample space \Omega is too complicated to model, as in the stock price example, we instead assume that the random variable representing the outcome of interest follows a certain probability distribution. In this way, we "abstract away" the sample space formalism and obtain a concrete tool to analyze the probabilities of possible events. The next section introduces important characteristics of probability distributions, and below, I will introduce a few of the most important probability distributions. Finally, I will explain how we can quantitatively measure whether an assumed probability distribution is reasonable or not. ### Mean, variance, and other parameters Probability distributions come in different shapes and sizes, but there are a few measures available which help characterize them in a standardized way. The mean of a random variable X, typically denoted \mu (or \mu_X when numerous r.v.'s are involved), is essentially the weighted average of the values X can take, weighted by probability of occurrence. Since the mean is also known as the expected value, it is also denoted {\Bbb E}[X]. If f is the pdf of X, then the mean is calculated as  {\Bbb E}[X] = \sum_{j = -\infty}^{\infty}{jf(j)}  for discrete (real-valued) r.v.'s and  {\Bbb E}[X] = \int_{-\infty}^{\infty}{x f(x) \, dx}  for continuous (real-valued) r.v.'s. Getting back to the physics analogy from which the mass/density terminology originates, the mean is the center of mass of the pdf. This means that if the height of the pdf chart represented a weight located at x along the number line, the mean would be the balancing point. Finally, note that the term "expected value" can be a bit misleading, as {\Bbb E}[X] may be a value which X can't actually take on. For example, the "expected" value of a roll of a die is 3.5. The variance of X, usually denoted \sigma^2, \sigma_X^2, or \text{Var}[X], is the (probability-)weighted average squared distance from the mean. Thus, the formula for variance is  \text{Var}[X] = \sum_{j=-\infty}^{\infty}{(j-{\Bbb E}[X])^2 f(j)}  for discrete r.v.'s and  \text{Var}[X] = \int_{-\infty}^{\infty}{(x-{\Bbb E}[X])^2 f(x) \, dx}  for continuous r.v.'s. Variance is a measure of the "spread" of X: higher variance means that we are more likely to observe values far from the mean. In the mass interpretation, the variance is the moment of inertia: higher variance means it is more difficult to spin the pdf around the mean. Due to the squaring in the above formulas, variance does not have the same units as X but rather those units squared (e.g. if X is in meters, then \sigma^2_X has units of square meters). For this reason, variance can be cumbersome to interpret, so we often use the standard deviation, defined as \sigma_X = \sqrt{\sigma^2_X}, which does have the same units as X. Standard deviation measures the average distance from the mean. Note that in order to compute this distance, it is necessary to find the average squared distance and then take the square root: if we tried to directly calculate the expected value of X - {\Bbb E}[X], we would always get zero due to the definition of expected value (since {\Bbb E}[X] is the "balancing point")! While the mean and variance/standard deviation are useful, they don't capture all the information about a pdf f. We could construct many very different pdf's with the same mean and variance, like the two in the figure below: Other popular measures used to characterize a probability distribution include the median, mode, quartiles, and higher-order moments (i.e. average third, fourth, fifth, etc. powers of distance from the mean). In the interest of length, I won't go into all the different quantitative characteristics of probability distributions, but the take-away point is that in order to fully characterize a random phenomenon, we need more than just the mean and variance. When someone provides us with a mean statistic, we should typically ask for information on the full probability distribution; the variance/standard deviation is a good start, but we should also look at a histogram, which represents sample data from the distribution: is it unimodal (one "hump") or multimodal (many "humps")? Is it symmetric around the mean? Does it have "fat tails" (i.e. are extreme values likely)? This qualitative information is equally as important as the number crunching for proper statistical analysis. ### A few important distributions In this section, I'll show you three important probability distributions which come up frequently in all sorts of applications. The first is the uniform distribution, a very simple continuous probability distribution which basically represents "a random number between a and b". Since probabilities must add up to 1, i.e. the total area under the curve of the pdf must be 1, the pdf of the uniform distribution is f(x) = \frac{1}{b-a} for all values of x between a and b (and zero otherwise)- in other words, a flat horizontal line. The pdf and cdf are pictured below: Next is the binomial distribution with parameters n and p. This discrete distribution represents the number of "successes" in n random, independent experiments, where each experiment has probability of success p and probability of failure 1-p (such experiments are called Bernoulli trials). Since the number of ways to place k successes within n experiments is the binomial coefficient n \choose k, and the probability of any particular such sequence is p^k (1-p)^{n-k}, the pdf is f(k) = {n \choose k} p^k (1-p)^{n-k}. It is relatively simple to prove that the mean and variance of a binomial distribution are np and np(1-p). Below are plots of the pmf and cmf for a few different parameter values: Last but not least is the Normal distribution (also called Gaussian distribution) with mean \mu and variance \sigma^2, the pervasive "bell curve", which has pdf's and cdf's as shown below for a few parameter values: As you can see from the pdf diagram, larger values of \mu move the peak to the right, while larger values of \sigma make the distribution fatter (and thus shorter, as the total area underneath must always be 1). At first glance, the formula for the Normal pdf,  f(x) = \frac{1}{\sigma \sqrt{2 \pi}} e^{-\frac{(x-\mu)^2}{2\sigma^2}}  looks daunting. However, thanks to various favorable properties of the exponential function (especially certain integral formulas involving exponentials in the integrand), calculations with the Normal distribution are actually quite tractable. In addition, this distribution conveniently provides error bands based on the 68-95-99.7 rule: ~68% of data from a Normal distribution falls within 1 standard deviation (1\sigma) of the mean, ~95% of the data falls within 2 standard deviations (2\sigma), and ~99.7% falls within 3 standard deviations (3\sigma): Thus, if we reasonably believe that a certain random variable has a Normal distribution, and we calculate a mean \bar{x} and standard deviation s from sample data, then we can assume that data will fall within 2s of \bar{x} 95% of the time, a very simple yet powerful tool for confidence interval analysis. The Normal distribution applies well to certain natural phenomena such as people's heights, errors in measurements, or the x-coordinate of the position of a particle diffusing in a liquid^{\ddagger}. However, in the next post on Monte Carlo methods, we will see that the bell curve also arises as the distribution of the sample means when we repeat a random experiment many times, a result known as the Central Limit Theorem. \ddagger: The fact that this is Normal is part of the mathematical definition of Brownian motion. Towards the end of the linked post, you will see (1) how the Normal distribution arises in the context of diffusion and (2) how Gaussian integral formulas make Normal distribution math a piece of cake. There are many other well-known probability distributions, but hopefully the three above give you a sense of what they can look like. In the final section of this post, we'll look briefly into how to choose the right distribution to model a random outcome of interest. ### Testing for goodness of fit Above, I mentioned that we can ignore impossibly complex sample spaces by simply assuming a particular probability distribution for a random variable we'd like to study. The question is, which probability distribution(s) can we reasonably assume? While a thorough treatment is beyond the scope of this post, at a high level, the process, known as hypothesis testing, typically proceeds as follows: 1. Hypothesize, i.e. guess, a distribution for the random variable in question. This is called the null hypothesis. Often, the null hypothesis is that the data follows a Normal distribution. 2. Gather sample data and calibrate the distribution using the sample mean, standard deviation, etc. 3. Assuming the random variable follows the (calibrated) hypothesized distribution, calculate how likely it is that hypothetical data from a sample of a certain size would differ from the theoretical distribution by at least as much as your sample data did. This is called a goodness-of-fit test. There are numerous well-known statistical goodness-of-fit tests such as the \chi^2 ("chi-squared") test. 4. The probability calculated in step 3 is called the P-value, and if it is lower than a pre-set threshold (usually 5% or 1%), i.e. if it's very unlikely that the sample data came from the hypothesized distribution, then we conclude that we must reject the null hypothesis. Whenever we base an analysis on the assumption that a random variable follows a certain distribution, we should always use a goodness-of-fit test as a sanity check. In addition, we can visually examine a histogram of the sample data, plotted on top of the null-hypothesized distribution, to gain a qualitative understanding of the results. ### Recap and an example In this post, we saw how the technical underpinnings of probability theory make their way into the definition of random variables, which model outcomes of random phenomena. We saw that a random variable is actually a function, not a variable, and that its input, not its output, drives the randomness in the model. Since the inputs may be far too complicated to describe directly, we can instead assume a certain probability distribution for the output and test our assumption using a statistical goodness-of-fit test. We often test for Normality so that we can take advantage of the 68-95-99.7 rule or other favorable properties to simplify our analysis. As a simple example, suppose you're an investment banker, and you want to lend a client money overnight, taking 100,000 shares of a stock from his portfolio as collateral. If the client fails to repay, you will liquidate the stock tomorrow at the then-prevailing price. If the stock trades at USD 100.00 per share right now, and you lend the client USD 9,800,000.00, is the 2% buffer enough to protect you in the unlikely case the client fails to repay? You download 5 years of historical data and plot the following histogram of the stock's daily % returns: The mean of the sample data is +0.06% and the standard deviation is 0.92%, so you decide to test the null hypothesis that this stock's daily returns are Normally distributed with mean 0.06% and standard deviation 0.92%. You frantically search gtmath.com for how to actually do a \chi^2 test and find that it's not on there; luckily, it's all over Google, and you can even do it in Excel. You run the test and obtain a P-value of 0.42. Since this is not less than 5%, you conclude that the Normality assumption was not unreasonable, and thus even if the client fails to repay, the probability is still less than 2.5% (half of 5% since we only care about downward moves) that the stock drops by more than 1.84% by tomorrow, so you deem the 2% buffer sufficient and go ahead with the loan. In the next post, I will demystify the buzzword "Monte Carlo simulation" and also introduce the Law of Large Numbers and Central Limit Theorem. For now, thanks for reading! ## Functions as Vectors (Part 2): The Dual Space Preliminaries: Functions as Vectors (Part 1), Basis and Dimension In this post, I'll introduce linear operators and functionals and the notion of duality to the vector space toolbox, focusing again on {\Bbb R}^n and the \ell^p spaces from the last post. While this content has numerous practical applications, for the purposes of this post, I'm going to focus on the "pure math" aspect, i.e. the study of the topic for the sake of satisfying intellectual curiosity; however, if there is reader interest, I am considering future posts on applications to physics (e.g. quantum mechanics) once I get through the already-outstanding reader requests. To that end, this post will culminate with the (reader-requested) proof of a theorem about the \ell^p spaces' relationship with their dual spaces. ### Linear Functionals Recall that the \ell^p spaces are vector spaces over {\Bbb R} (or {\Bbb C}, but once again, we will focus on vector spaces over the real numbers). This means that the addition and scaling of vectors are compatible with the properties of real number addition and multiplication (if this is not clear, see this post), so that an expression like a{\bf x} + b{\bf y}, where a,b are numbers/scalars and {\bf x}, {\bf y} are sequences/vectors in \ell^p, is a new vector in \ell^p. We can define functions from a vector space to another vector space (or itself)- these are called operators. Let V and W be vector spaces (over the real numbers), and let T:V \rightarrow W be an operator. So T takes input vectors {\bf v} from V and maps them to output vectors {\bf w} in W. If for all c \in {\Bbb R}, {\bf v}_1, {\bf v}_2 \in V, the following hold:  \begin{align} T({\bf v}_1 + {\bf v}_2) &= T({\bf v}_1) + T({\bf v}_2) \tag{1} \\ T(c{\bf v}_1) &= c \, T({\bf v}_1) \tag{2} \end{align}  then T is called a linear operator. These generalize linear functions on the vector space {\Bbb R} (a vector space in its own right over itself) of the form f(x) = mx. Note that condition (2) disqualifies linear functions of the form f(x) = mx + b where b \neq 0 from being considered linear operators in the vector space sense since then f(0) \neq 0. Now, if W is the underlying field of scalars of V, i.e. W = {\Bbb R} since we're only talking about real vector spaces for now, then T is called a linear functional. While we often use capital letters like T for linear operators between vector spaces, it is common practice to use lower-case Greek letters such as \phi and \psi for linear functionals. ### Bounded/Continuous Linear Functionals In the preliminary post, I mentioned that a norm (measure of the size of a vector) gives us a way to measure the distance between two vectors {\bf x} and {\bf y}. Namely, the distance is the size of the difference vector: d({\bf x}, {\bf y}) = \| {\bf x} - {\bf y} \|, and this formula coincides with the distance formula for the familiar "arrow vectors" of Euclidean space. This notion of distance allows us to define continuity of linear operators similarly to how we define continuity of functions in calculus. In calculus, a function is continuous if, when the input values are close enough to a specific value x_0, the output values are close to f(x_0). In symbols, given a tolerance \epsilon (typically a small positive number), we need to be able to provide a number \delta such that |f(x)-f(x_0)| < \epsilon whenever |x-x_0|<\delta. Here, \delta can depend on both \epsilon and x_0, but if it does not depend on x_0, then f is called uniformly continuous. Similarly, an operator T between normed vector spaces V and W is continuous at {\bf x}_0 \in V if for any \epsilon>0, there exists a \delta>0 such that  \|T({\bf x})-T({\bf x}_0)\|_W < \epsilon  whenever  \|{\bf x} - {\bf x}_0 \|_V < \delta  Separately, an operator T is called bounded if there exists a bound M on how much T "blows up" the size of an input vector. In symbols, T is bounded if there exists an M>0 such that for all {\bf x} \in V,  \|T({\bf x})\|_W \leq M \| {\bf x} \|_V  Note that here, M does not depend on the choice of {\bf x}. Furthermore, the smallest (technically, the least upper bound) M such that this holds is called the operator norm of T and is denoted \| T \|_{\rm op}. Thus, it is always the case that for any {\bf x} \in V, \|T({\bf x})\|_W \leq \|T\|_{\rm op} \|{\bf x}\|_V. We will use this later in the post. It turns out that when T is linear, boundedness and continuity are equivalent. To simplify the notation, I'll focus on the case of linear functionals (i.e. where W = {\Bbb R} and \| \cdot \|_W is absolute value), but the same proof holds for general W's as well. Theorem: For a linear functional \phi on a normed vector space V, the following are equivalent: 1. \phi is bounded. 2. \phi is continuous. 3. \phi is continuous at {\bf 0}. Proof: To prove the 3 statements are equivalent, we will prove that 1 \implies 2, 2 \implies 3, and 3 \implies 1. 1 \implies 2: Since \phi is linear and bounded, there exists an M>0 such that  | \phi({\bf x}) - \phi({\bf y}) | = | \phi( {\bf x} - {\bf y})| \leq M \| {\bf x} - {\bf y} \|  Therefore, for any \epsilon>0, let \delta < \epsilon / M; then the above implies that | \phi({\bf x}) - \phi({\bf y})| < \epsilon whenever \| {\bf x} - {\bf y} \| < \delta. So \phi is continuous. 2 \implies 3: If \phi is continuous everywhere, then in particular, it is continuous at {\bf 0}. 3 \implies 1: Since \phi is linear, \phi({\bf 0})=0. Suppose \phi is continuous at {\bf 0}. Then for \epsilon = 1, there exists a \delta>0 such that |\phi({\bf x})|<1 whenever \|{\bf x}\|<\delta. For any nonzero {\bf x} \in V, define {\bf u} = \tfrac{1}{2} \delta \tfrac{{\bf x}}{\| {\bf x} \|}, so that \| {\bf u} \| = \tfrac{1}{2} \delta. Thus,  1 > |\phi({\bf u})| = \left| \phi \left( \frac{\delta}{2\|{\bf x}\|} {\bf x} \right) \right| = \frac{\delta}{2\|{\bf x}\|} |\phi({\bf x})|  which implies that |\phi({\bf x})|< \tfrac{2}{\delta}\|{\bf x}\|. Thus, \phi is bounded with M=\tfrac{2}{\delta}. \square Since the \delta in the first part of the proof does not depend on {\bf x}, boundedness is actually equivalent to uniform continuity. Now, if V is finite-dimensional, then all linear functionals are bounded/continuous. In an infinite-dimensional space, unbounded/discontinuous operators do exist; however, in many spaces, such as \ell^2, their existence cannot be shown constructively, but rather is proved using the axiom of choice. This means that in practice, any linear functional which you can think up, and which is defined on the entire \ell^2 space (see note below), is bounded. Note: More specifically, when a space contains the limits of all sequences which "should" converge, i.e. those whose points eventually become arbitrarily close together (known as Cauchy sequences), it is called complete. This basically means that it has no "holes". The real numbers and the \ell^p spaces are complete, while the rational numbers are not, since a Cauchy sequence like 3, 3.1, 3.14, 3.141, 3.1415, \dotsc "should" have the limit \pi, but this is not a rational number. In an incomplete space, we can sometimes explicitly construct an unbounded linear operator defined on the entire space, but in a complete space, we need the axiom of choice to prove their existence. However, it is certainly possible to define an unbounded linear operator on an incomplete subspace (called the domain of the operator) of a complete space: the derivative operator, defined on the subset of L^2 consisting of differentiable functions, is linear and unbounded. While many important operators are unbounded, these are trickier to deal with since we always need to keep their domains in mind. ### The Dual Space Just as we showed for function spaces such as the \ell^p spaces, we can define a vector space structure on the set of linear functionals on a vector space V. Vector addition is defined by  (\phi+\psi)({\bf x}) = \phi({\bf x}) + \psi({\bf x})  and scalar multiplication by  (c\phi)({\bf x}) = c \phi({\bf x})  With these operations, the set of all linear functionals on V is a vector space in its own right, usually denoted V^* or V'. This is called the (algebraic) dual space of V. If we exclude the unbounded linear functionals, then we obtain the continuous dual space of V, which avoids the domain issues associated with unbounded operators. For the remainder of this post, I will use the term dual space to refer to the continuous dual space. Thus far, we have seen definitions and properties of linear functionals but have yet to see what they look like in practice. In the finite-dimensional case, suppose we have a vector space V with basis {\bf e}_1, {\bf e}_2, \dotsc, {\bf e}_n. Then for some vector {\bf x} = x_1 {\bf e}_1 + \dotsb + x_n {\bf e}_n and a linear functional \phi, the linearity of \phi implies that \[ \begin{align} \phi({\bf x}) &= \phi( x_1{\bf e}_1 + x_2{\bf e}_2 + \dotsb + x_n{\bf e}_n ) \\[1mm] &= x_1 \phi({\bf e}_1) + x_2 \phi({\bf e}_2) + \dotsb + x_n \phi({\bf e}_n) \\[1mm] &= a_1 x_1 + a_2 x_2 + \dotsb + a_n x_n \tag{\spadesuit} \end{align} where $a_i = \phi({\bf e}_i)$. So knowing the action of $\phi$ on the basis vectors tells us its effect on all other vectors by linearity. Since all linear functionals must have the form above, we can specify a linear functional entirely by the numbers $a_1, a_2, \dotsc, a_n$ which represent its evaluation on the basis vectors.

Now, given a basis $\lbrace {\bf e}_i \rbrace$, we can define a corresponding dual basis, denoted $\lbrace {\bf e}^i \rbrace$, of the dual space where we define ${\bf e}^{j}(x_1 {\bf e}_1 + \dotsb + x_n {\bf e}_n) = x_j$. Note that the superscripts just represent labels, not exponents. The so-called dual basis spans $V^*$ since, by $( \spadesuit )$, any functional $\phi$ above has the representation $\phi = a_1 {\bf e}^1 + \dotsb + a_n {\bf e}^n$. Linear independence is also easy to prove.

Given the dual basis, the dual space $V^*$ is starting to look like the same thing as the original space $V$, just with different labels, i.e. it's starting to look like $V$ and $V^*$ are isomorphic (don't worry, I'll define this word rigorously below). Indeed, this is the case for finite-dimensional spaces. In the next section, I will show you that this isn't exactly the case with the infinite-dimensional $\ell^p$ spaces, but that a similar result holds.

Before we go there, a quick note on etymology: the dual space $V^*$ is called "dual" because it presents us with an alternative way to unambiguously specify a given vector in $V$. The element ${\bf e}^j$ of the dual basis essentially measures the ${\bf e}_j$-component of an input vector. Accordingly, if we know how all the ${\bf e}^j$'s (and thus all functionals) act on a vector ${\bf x}$, then we know the components of ${\bf x}$, i.e. we know which vector ${\bf x}$ is without ambiguity. The following proposition formalizes this idea.

Proposition: Let ${\bf x}, {\bf y} \in V$ be vectors, and assume that for all functionals $\phi \in V^*$, we have $\phi({\bf x})=\phi({\bf y})$. Then ${\bf x}={\bf y}$.

Proof: Define ${\bf z} = {\bf x} - {\bf y}$. Then given any functional $\phi \in V^*$, the assumption above and the linearity of $\phi$ imply that $\phi({\bf z}) = \phi({\bf x}-{\bf y}) = \phi({\bf x})-\phi({\bf y}) = 0 \tag{\clubsuit}$ Assume ${\bf x} \neq {\bf y}$, i.e. ${\bf z} \neq {\bf 0}$. Since ${\bf z}$ is nonzero, $\lbrace {\bf z} \rbrace$ is a linearly independent set, so we can extend it to a basis $B_{\bf z} = \lbrace {\bf z}, {\bf e}_2, \dotsc, {\bf e}_n \rbrace$ for $V$.

For a generic vector ${\bf v} \in V$ with coordinates $(v_1, v_2, \dotsc, v_n)$ in the basis $B_{\bf z}$, define the functional $\phi$ by $\phi({\bf v}) = v_1$ (this is just ${\bf e}^1$ for the basis $B_{\bf z}$). Then $\phi$ is a functional with $\phi({\bf z})=1$, which contradicts $( \clubsuit )$. Therefore, it must be the case that ${\bf z} = 0$ after all.
$\square$

### The Dual of $\ell^p$

By a similar argument (to be formally justified in the proof below) as that presented above for a finite-dimensional vector space, all linear functionals $\phi \in \left( \ell^p \right)^*$ will take the form $\phi({\bf x}) = a_1 x_1 + a_2 x_2 + a_3 x_3 + \dotsb \tag{\spadesuit}$ for a generic vector ${\bf x} = (x_1, x_2, x_3, \dotsc) \in \ell^p$. Since we are talking about the continuous dual space, any such $\phi$ must also be bounded, which by definition means that $|\phi({\bf x})| \leq \| \phi \|_{\rm op} \| {\bf x} \|_p \tag{\heartsuit}$ In other words, we are only dealing with functionals such that the series in $( \spadesuit )$ converges.

The linear functionals ${\bf e}^j$ defined by ${\bf e}^j({\bf x}) = x_j$ once again obviously span the dual space and are linearly independent, so they once again form the standard dual basis, and we can represent the functional $\phi = a_1 {\bf e}^1 + a_2 {\bf e}^2 + a_3 {\bf e}^3 + \dotsb$ by its coordinates in this basis: $\phi = (a_1, a_2, a_3, \dotsc )$.

In this form, an element of the dual space looks like another infinite sequence which, by $( \heartsuit )$, satisfies some sort of convergence condition. Thus, it stands to reason that we may be able to identify linear functionals, i.e. elements of $\left( \ell^p \right)^*$, with elements of one of the spaces $\ell^q$ for a suitable value of $q$.

We are going to make exactly such an identification, and in order to do so, we need to rigorously define what it means to "identify" an element of one vector space with an element of another. Suppose we have vector spaces $V$ and $W$ and a function $T: V \rightarrow W$ which maps elements of $V$ to elements of $W$. If $T$ is one-to-one (also known as a bijection), which means that for each ${\bf w} \in W$, there is one and only one ${\bf v} \in V$ for which $T({\bf v})={\bf w}$, and $T$ preserves vector addition and scalar multiplication, i.e. $T(a{\bf v}_1 + b{\bf v}_2) = aT({\bf v}_1) + bT({\bf v}_2)$ then we call $T$ a (vector space) isomorphism and write $V \cong W$.

The preservation of vector addition and scalar multiplication, the cornerstones of a vector space structure, amounts to nothing more than the definition of linearity, so an isomorphism is actually just a linear operator which is also one-to-one. Furthermore, because of linearity, if a linear operator $T$ maps basis vectors one-to-one to basis vectors, then we can already conclude that $T$ is an isomorphism. Finally, if an isomorphism preserves the norm/metric, i.e. $\|T({\bf v})\|_W = \|{\bf v}\|_V$ for all ${\bf v} \in V$, then it is called an isometry, and $V$ and $W$ are called isometrically isomorphic.

With all the terminology, definitions, and explanation from the preliminary posts and this post out of the way, we can finally state and prove the following theorem which answers Charles Stephens's Reader Request:

Theorem: Let $p$ and $q$ be Hölder conjugates, i.e. $\tfrac{1}{p}+\tfrac{1}{q}=1$, with $1 < p,q < \infty$. Then $\left( \ell^p \right)^*$ is isometrically isomorphic to $\ell^q$, where the norm on $\left( \ell^p \right)^*$ is understood to be the operator norm.

Proof: We need to show the existence of a one-to-one bounded linear operator $T: \ell^q \rightarrow (\ell^p)^*$ which is also an isometry. For a sequence ${\bf x} = (x_1, x_2, x_3, \dotsc) \in \ell^q$, define $T({\bf x})$ to be the functional $\phi_{\bf x} \in (\ell^p)^*$ which maps ${\bf y} = (y_1, y_2, y_3, \dotsc) \in \ell^p$ to the number $\phi_{\bf x}({\bf y}) = \sum_{i=1}^{\infty}{x_i y_i}$ By Hölder's inequality, $|\phi_{\bf x}({\bf y})| \leq \|{\bf x}\|_q \|{\bf y}\|_p$, which implies that

1. the sum specified by $\phi_{\bf x}({\bf y})$ converges, i.e. $\phi_{\bf x}$ is indeed an element of $(\ell^p)^*$, i.e. $T$ is well defined, and
2. $\| \phi_{\bf x} \|_{\rm op} \leq \| {\bf x} \|_q$, so $T$ is bounded.
Furthermore, $T$ is clearly linear from its definition, so $T$ is a well defined, bounded linear operator.

We will show that $T$ is one-to-one by showing that it has an inverse (can you see why being one-to-one is equivalent to having an inverse?). Define the linear operator $U: (\ell^p)^* \rightarrow \ell^q$ by $U({\phi})= (\phi({\bf e}_1), \phi({\bf e}_2), \phi({\bf e}_3), \dotsc)$ For notational simplicity, define $b_j = \phi({\bf e}_j)$, so that $U({\phi})={\bf b}$. Now, define $a_j = \cases{ \frac{|b_j|^q}{b_j} & \text{if } j \leq n \\ 0 & \text{if } j > n }$ where $n$ is (for now) some fixed integer, and $a_j$ is interpreted to be $0$ if $b_j=0$. We want to show that $U$ is the inverse, i.e. "undoes" the action, of $T$, but first we need to show it is well defined and bounded (it is obviously linear by its definition). Since we'll be taking a limit as $n \rightarrow \infty$, we'll assume without loss of generality that $n$ is large enough that at least one of the $b_j$'s is non-zero; we know such an $n$ exists whenever $\phi \neq 0$, while if $\phi=0$, we already know $U(\phi)={\bf 0}$.

First of all, since the sequence ${\bf a} = (a_j)$ terminates at the finite value $n$, ${\bf a}$ is certainly in $\ell^p$. Secondly, using the fact that $q=\tfrac{p}{p-1}$, we can do a few simple algebraic manipulations (I'll skip them here for the sake of brevity) to show that $\| {\bf a} \|_p = \left( \sum_{j=1}^{n}{|b_j|^q} \right)^{1/p} \tag{\dagger}$ Also, \begin{align} \phi({\bf a}) &= \phi \left( \left( \frac{|b_1|^q}{b_1}, \frac{|b_2|^q}{b_2}, \dotsc, \frac{|b_n|^q}{b_n}, 0, 0, 0, \dotsc \right) \right) \\[2mm] &= \phi \left( \frac{|b_1|^q}{b_1} {\bf e}_1 + \frac{|b_2|^q}{b_2} {\bf e}_2 + \dotsb + \frac{|b_n|^q}{b_n} {\bf e}_n \right) \\[2mm] &= \phi \left( \sum_{j=1}^{n}{\frac{|b_j|^q}{b_j} {\bf e}_j} \right) \\[2mm] &=\sum_{j=1}^{n}{ \phi \left( \frac{|b_j|^q}{b_j} {\bf e}_j \right) } \\[2mm] &=\sum_{j=1}^{n}{ \frac{|b_j|^q}{b_j} \phi \left( {\bf e}_j \right) } \\[2mm] &=\sum_{j=1}^{n}{ \frac{|b_j|^q}{b_j} b_j } \\[2mm] &=\sum_{j=1}^{n}{ |b_j|^q } \tag{\ddagger} \end{align} Therefore, \begin{align} \left( \sum_{j=1}^{n}{|b_j|^q} \right)^{1/q} &= \left( \sum_{j=1}^{n}{|b_j|^q} \right)^{1-1/p} \\[2mm] &= \frac{\sum_{j=1}^{n}{|b_j|^q}}{\left( \sum_{j=1}^{n}{|b_j|^q} \right)^{1/p}} \\[2mm] &= \frac{|\phi({\bf a})|}{\|{\bf a}\|_p} \tag{by \dagger, \ddagger}\\[2mm] &\leq \| \phi \|_{\rm op} \end{align} Since this holds for all $n$ (large enough as mentioned above), we can take the limit as $n \rightarrow \infty$ to conclude that $\|{\bf b}\|_q = \| U({\phi}) \|_q \leq \| \phi \|_{\rm op}$. In other words, $U$ is well defined and bounded.

Now it is a piece of cake to show that $U$ and $T$ are inverses: \begin{align} U(T({\bf x})) = U(\phi_{\bf x}) &= (\phi_{\bf x}({\bf e}_1), \phi_{\bf x}({\bf e}_2), \phi_{\bf x}({\bf e}_3), \dotsc) \\[1mm] &= (x_1, x_2, x_3, \dotsc) = {\bf x} \end{align} and \begin{align} T(U(\psi))({\bf y}) = \phi_{U(\psi)}({\bf y}) &= \sum_{i=1}^{\infty}{\psi({\bf e}_i) y_i} \\[1.5mm] &= \sum_{i=1}^{\infty}{\psi(y_i {\bf e}_i)} \\[1.5mm] &= \psi \left( \sum_{i=1}^{\infty}{y_i {\bf e}_i} \right) = \psi({\bf y}) \end{align} i.e. $T(U(\psi))$ is the same functional as $\psi$ since they have the same action on any input vector ${\bf y} \in \ell^p$.

Finally, we showed at the beginning of the proof that for any ${\bf x} \in \ell^q$, we have $\| \phi_{\bf x} \|_{\rm op} = \| T({\bf x}) \|_{\rm op} \leq \|{\bf x}\|_q$.

We also showed, using $(\dagger)$ and $(\ddagger)$, that if ${\bf x} \in \ell^q$ is such that ${\bf x}=U(\phi)$ for some $\phi \in (\ell^p)^*$, then $\| \phi \|_{\rm op} \geq \|{\bf x}\|_q$. But since $U$ and $T$ are inverses, ${\bf x}=U(\phi) \iff \phi = T({\bf x}) = \phi_{\bf x}$, so $\| \phi_{\bf x} \|_{\rm op} = \| T({\bf x}) \|_{\rm op} \geq \|{\bf x}\|_q$. Thus, $\|T({\bf x})\|_{\rm op} = \|{\bf x}\|_q$, which proves that $T$ is an isometry.
$\square$

It's worth noting that the proof above can be slightly modified to prove that $(\ell^1)^* \cong \ell^{\infty}$, the space of all bounded sequences. On the other hand, $(\ell^{\infty})^* \ncong \ell^1$ and is a bit more complicated. However, it is the case that $(c_0)^* \cong \ell^1$, where $c_0$ is the subspace of $\ell^{\infty}$ consisting of all sequences which converge to $0$.

That will conclude this post, and I hope it was informative/enjoyable. Feel free to post any questions in the comments section.