## How do mathematicians model randomness?

Preliminaries: set notations

In this post, I will lay out the mathematical formalism of probability theory and explain the intuition behind its basic components. This post will begin with the abstract concept of probability spaces, but I'll show you how the formalism is cleverly set up so that, in practice, we only ever need to directly interact with the more tangible notion of probability distributions of random variables. If you've ever studied coin tosses, dice rolls, or bell curves, then you'll see that you're already familiar with probability distributions.

After introducing probability distributions, I'll show you a few of the most important ones and wrap up with a simple example of how to use them.

For readers not interested in technical details, you might still try skimming the next section as probability spaces are central to the definition of random variables. However, if you're truly averse, you can probably get by without it.

### The probability space

In order to model random phenomena, we should have some sort of "driver" of randomness in the model. Mathematicians call this a probability space, and it consists of 3 objects, typically denoted $\Omega$, $\mathcal{F}$, and ${\Bbb P}$.

$\Omega$ is called the sample space, and it is a non-empty set whose elements represent some inherently uncertain outcome. For example, an element $\omega \in \Omega$ could be the result of a coin toss or dice roll or, more generally, an overall state of nature or of the world/universe. The sample space is purposely allowed to be vaguely specified, and we will see below how the formalism allows us to not really worry about what $\Omega$ is. This is because we never need to deal with $\Omega$ directly; instead, as mentioned in the introduction, we only interact directly with probability distributions of random variables, both of which will be defined below, so we'll come back to this topic.

$\mathcal{F}$ is a $\sigma$-algebra, a special type of set containing possible events to which we can assign probabilities. The purpose of this post is to explain probability without going into the details of measure theory (which is what you will find in some of the Wikipedia articles), so I am not even going to give a formal definition of a $\sigma$-algebra here. What you need to know is that the elements of $\mathcal{F}$ are subsets of $\Omega$ and that $\mathcal{F}$ will include any "event" you can reasonably conceive, i.e. anything built using AND's (intersections), OR's (unions), and NOT's (complements) of other known events, including (countably) infinite combinations of these. The only subsets of $\Omega$ which $\mathcal{F}$ will not include are the non-measurable ones, which never arise in practical applications anyhow. Please refer to my axiom of choice post for more color on non-measurable sets (and why you shouldn't lose any sleep over them).

Finally, ${\Bbb P}$ is a probability measure, a function which assigns to each event (i.e. to each element of $\mathcal{F}$) a number between 0 and 1, its probability. In symbols, ${\Bbb P}: \mathcal{F} \rightarrow \left[ 0,1 \right]$. A probability measure has two properties in addition to returning values in $[0,1]$:

• Countable additivity: given a collection of pairwise disjoint sets $A_1, A_2, A_3, \dotsc$ in $\mathcal{F}$ (pairwise disjoint means none of them overlap, i.e. $A_i \cap A_j = \emptyset$ for all $i,j$), ${\Bbb P}\left( \bigcup_{i=1}^{\infty}{A_i} \right) = \sum_{i=1}^{\infty}{{\Bbb P}(A_i)}$. In other words, probabilities add for disjoint events.
• ${\Bbb P}(\Omega) = 1$, i.e. the probability of everything is 1.
These properties imply the other familiar probability results such as ${\Bbb P}(A^{C}) = 1 - {\Bbb P}(A)$ and, if $A \subseteq B$, ${\Bbb P}(A) \leq {\Bbb P}(B)$.

With its 3 components, a probability space $(\Omega, \mathcal{F}, {\Bbb P})$ serves as the "input" for a model of a random phenomenon. In the next section, we will look at the "output".

### Random variables

A random variable (r.v.) is a function$^{\dagger}$ which, given the outcome of a random experiment/state of nature/etc. (represented by an element $\omega \in \Omega$), returns a numerical output of interest. The nomenclature is a bit misleading as a random variable is a function, not a variable. Furthermore, it is not random per se, as the randomness is already captured in the input, $\omega$.

$\dagger$: technically, a random variable must be a measurable function, a restriction meant to exclude functions such as $f(x)=1$ if $x \in A$ and $f(x)=0$ if $x \notin A$, where $A$ is a non-measurable set. While this is a nice example to test the rigor of the theory, it is clearly not the type of function that would ever arise in a practical application, since we would need to invoke the axiom of choice to define a set such as $A$. In short, we don't need to worry about the measurability restriction in practical applications.

As a concrete example, let $\Omega$ represent the set of possible outcomes of rolling five 6-sided dice, so a sample element would be $\omega = (1,4,2,2,6)$. In this case, $\Omega$ is finite with $6^5 = 7{,}776$ elements. Define a function $X: \Omega \rightarrow {\Bbb R}$ where $X(\omega)$ is the sum of the 5 rolls in $\omega$. Note that $X$ is a deterministic (i.e. not random) function based on the random outcome of the dice rolls.

To compute the probability that $X=29$ (for example), we add up the probabilities of the $\omega$'s which will make $X$ equal 29. The only way to have $X=29$ is to have one roll result in 5 and the rest result in 6. Thus, \begin{align} {\Bbb P}(X=29) &= {\Bbb P}\left( \lbrace \omega \in \Omega: X(\omega)=29 \rbrace \right) \\ &= {\Bbb P}(\text{one 5 and four 6's}) \\ &= (5 \cdot (1/6)) \cdot (1/6)^4 \\ &\approx .06\% \end{align} Note for technical readers: by definition, ${\Bbb P}$ takes elements of $\mathcal{F}$ (i.e. subsets of $\Omega$) as inputs and returns a number 0 through 1, but we have written "${\Bbb P}(X=29)$" above. In this slight abuse of notation, we have implicitly used $X$ to "push ${\Bbb P}$ forward" to a probability measure $X_{*}{\Bbb P}$ (the pushforward measure) which takes sets of output values of $X$ as inputs (as opposed to taking subsets of $\Omega$, which actually consist of input values for $X$). Since the interpretation is not ambiguous, it is very common to see an expression such as ${\Bbb P}(X=29)$ written here in place of the more correct $X_{*}{\Bbb P}(\lbrace 29 \rbrace)$.

For a more complicated example of a random variable, let $X$ be the function from $\Omega$ to ${\Bbb R}^{+}$ (positive real numbers- thus, we are ignoring finite minimum tick sizes) which returns the price of some stock at a future time $t$. In this case, an element $\omega \in \Omega$ may represent the history of all trades ever in this stock, which in turn rely on too many phenomena to even imagine. In order to analyze the probability that $X$ takes on a certain value or range of values, we would need to make some assumptions. At the end of this post, we'll see how to do this.

### Probability distribution of a random variable

Going back to the example of rolling 5 dice and computing their sum, the possible outputs of $X$ are the integers 5 through 30. Let's call this set of integers $R$ (for the "range" of $X$). We can define a function $f: R \rightarrow [0,1]$ by $f(r) = {\Bbb P}(X=r)$, whose graph would look like this:

Note that $f(r)$ must be between 0 and 1 for each $r$ (i.e. the $y$-axis starts at 0 and goes up to at most 1) since probabilities are always between 0 and 1. Such a function $f$ is called the probability mass function (pmf), or just probability distribution, of the random variable $X$. It is sometimes more convenient to use the cumulative mass function (cmf) of $X$, which is defined by $F(r) = \small\sum_{j \leq r}}{f(j)$. From this formula, we can see that $F(r)$ represents the probability that $X$ takes on a value less than or equal to $r$; clearly, $F$ is increasing, i.e. $F(r_1) \leq F(r_2)$ when $r_1 \leq r_2$, and $F(30) = 1$. $f$ and $F$ contain the same information, so we can choose the one which is most convenient for a particular use.

Note also that due to the definitions of $f$ and $F$ as probabilities, it must be the case that $\small\sum_{r \in R}}{f(r)} =$, which is equivalent to the statement above that $F(30) = 1$.

Note: some authors use the term "probability distribution" to refer to the cumulative mass function. In this post and others, I will explicitly use the word "cumulative" if applicable and will use "probability distribution" to refer to the non-cumulative version.

The random variable above is a discrete random variable, meaning it can only take on a countable (in this case, finite) number of values. The stock price example above is a continuous random variable, meaning it can take on an uncountable number of values (in plain English: values from a continuum), in this case ${\Bbb R}^{+}$.

Probability distributions for continuous random variables need to be treated differently since there are so many possible values that the probability of any particular value is effectively zero. Instead, we need to think in terms of ranges of possible values. In this case, we define the probability density function (pdf) $f: {\Bbb R}^{+} \rightarrow [0,1]$ as a function for which $${\Bbb P}(a \leq X \leq b) = \int_{a}^{b}{f(x)dx}$$ Instead of adding the probabilities of values for which $X$ falls between $a$ and $b$, we need to integrate, which is the continuous analog of summation. By analogy to distributions of mass in physics, we also use the word "density" for continuously distributed probability to replace "mass" from the discrete case. However, the term probability density function is more general, and so we often use this term (or its abbreviation, pdf) in the discrete case as well.

Similarly, the cumulative density function (cdf) $F$ would be defined by $$F(x) = {\Bbb P}(X \leq x) = \int_{- \infty}^{x}{f(t)dt}$$ where, once again, the summation from the discrete formula has been replaced by integration.

In the stock price example, the pdf of $X$ may look something like this:

As in the discrete case, the probability interpretation of $f$ and $F$ dictates that $$\int_{-\infty}^{\infty}{f(x)dx} = 1$$ which is equivalent to $\displaystyle \lim_{x \rightarrow \infty}F(x) = 1$.

When the underlying sample space $\Omega$ is too complicated to model, as in the stock price example, we instead assume that the random variable representing the outcome of interest follows a certain probability distribution. In this way, we "abstract away" the sample space formalism and obtain a concrete tool to analyze the probabilities of possible events. The next section introduces important characteristics of probability distributions, and below, I will introduce a few of the most important probability distributions. Finally, I will explain how we can quantitatively measure whether an assumed probability distribution is reasonable or not.

### Mean, variance, and other parameters

Probability distributions come in different shapes and sizes, but there are a few measures available which help characterize them in a standardized way.

The mean of a random variable $X$, typically denoted $\mu$ (or $\mu_X$ when numerous r.v.'s are involved), is essentially the weighted average of the values $X$ can take, weighted by probability of occurrence. Since the mean is also known as the expected value, it is also denoted ${\Bbb E}[X]$. If $f$ is the pdf of $X$, then the mean is calculated as $${\Bbb E}[X] = \sum_{j = -\infty}^{\infty}{jf(j)}$$ for discrete (real-valued) r.v.'s and $${\Bbb E}[X] = \int_{-\infty}^{\infty}{x f(x) \, dx}$$ for continuous (real-valued) r.v.'s.

Getting back to the physics analogy from which the mass/density terminology originates, the mean is the center of mass of the pdf. This means that if the height of the pdf chart represented a weight located at $x$ along the number line, the mean would be the balancing point. Finally, note that the term "expected value" can be a bit misleading, as ${\Bbb E}[X]$ may be a value which $X$ can't actually take on. For example, the "expected" value of a roll of a die is 3.5.

The variance of $X$, usually denoted $\sigma^2$, $\sigma_X^2$, or $\text{Var}[X]$, is the (probability-)weighted average squared distance from the mean. Thus, the formula for variance is $$\text{Var}[X] = \sum_{j=-\infty}^{\infty}{(j-{\Bbb E}[X])^2 f(j)}$$ for discrete r.v.'s and $$\text{Var}[X] = \int_{-\infty}^{\infty}{(x-{\Bbb E}[X])^2 f(x) \, dx}$$ for continuous r.v.'s. Variance is a measure of the "spread" of $X$: higher variance means that we are more likely to observe values far from the mean. In the mass interpretation, the variance is the moment of inertia: higher variance means it is more difficult to spin the pdf around the mean.

Due to the squaring in the above formulas, variance does not have the same units as $X$ but rather those units squared (e.g. if $X$ is in meters, then $\sigma^2_X$ has units of square meters). For this reason, variance can be cumbersome to interpret, so we often use the standard deviation, defined as $\sigma_X = \sqrt{\sigma^2_X}$, which does have the same units as $X$. Standard deviation measures the average distance from the mean. Note that in order to compute this distance, it is necessary to find the average squared distance and then take the square root: if we tried to directly calculate the expected value of $X - {\Bbb E}[X]$, we would always get zero due to the definition of expected value (since ${\Bbb E}[X]$ is the "balancing point")!

While the mean and variance/standard deviation are useful, they don't capture all the information about a pdf $f$. We could construct many very different pdf's with the same mean and variance, like the two in the figure below:

Other popular measures used to characterize a probability distribution include the median, mode, quartiles, and higher-order moments (i.e. average third, fourth, fifth, etc. powers of distance from the mean). In the interest of length, I won't go into all the different quantitative characteristics of probability distributions, but the take-away point is that in order to fully characterize a random phenomenon, we need more than just the mean and variance.

When someone provides us with a mean statistic, we should typically ask for information on the full probability distribution; the variance/standard deviation is a good start, but we should also look at a histogram, which represents sample data from the distribution: is it unimodal (one "hump") or multimodal (many "humps")? Is it symmetric around the mean? Does it have "fat tails" (i.e. are extreme values likely)? This qualitative information is equally as important as the number crunching for proper statistical analysis.

### A few important distributions

In this section, I'll show you three important probability distributions which come up frequently in all sorts of applications.

The first is the uniform distribution, a very simple continuous probability distribution which basically represents "a random number between $a$ and $b$". Since probabilities must add up to 1, i.e. the total area under the curve of the pdf must be 1, the pdf of the uniform distribution is $f(x) = \frac{1}{b-a}$ for all values of $x$ between $a$ and $b$ (and zero otherwise)- in other words, a flat horizontal line. The pdf and cdf are pictured below:

Next is the binomial distribution with parameters $n$ and $p$. This discrete distribution represents the number of "successes" in $n$ random, independent experiments, where each experiment has probability of success $p$ and probability of failure $1-p$ (such experiments are called Bernoulli trials). Since the number of ways to place $k$ successes within $n$ experiments is the binomial coefficient $n \choose k$, and the probability of any particular such sequence is $p^k (1-p)^{n-k}$, the pdf is $f(k) = {n \choose k} p^k (1-p)^{n-k}$. It is relatively simple to prove that the mean and variance of a binomial distribution are $np$ and $np(1-p)$. Below are plots of the pmf and cmf for a few different parameter values:

Last but not least is the Normal distribution (also called Gaussian distribution) with mean $\mu$ and variance $\sigma^2$, the pervasive "bell curve", which has pdf's and cdf's as shown below for a few parameter values:

As you can see from the pdf diagram, larger values of $\mu$ move the peak to the right, while larger values of $\sigma$ make the distribution fatter (and thus shorter, as the total area underneath must always be 1). At first glance, the formula for the Normal pdf, $$f(x) = \frac{1}{\sigma \sqrt{2 \pi}} e^{-\frac{(x-\mu)^2}{2\sigma^2}}$$ looks daunting. However, thanks to various favorable properties of the exponential function (especially certain integral formulas involving exponentials in the integrand), calculations with the Normal distribution are actually quite tractable. In addition, this distribution conveniently provides error bands based on the 68-95-99.7 rule: ~68% of data from a Normal distribution falls within 1 standard deviation ($1\sigma$) of the mean, ~95% of the data falls within 2 standard deviations ($2\sigma$), and ~99.7% falls within 3 standard deviations ($3\sigma$):

Thus, if we reasonably believe that a certain random variable has a Normal distribution, and we calculate a mean $\bar{x}$ and standard deviation $s$ from sample data, then we can assume that data will fall within $2s$ of $\bar{x}$ 95% of the time, a very simple yet powerful tool for confidence interval analysis.

The Normal distribution applies well to certain natural phenomena such as people's heights, errors in measurements, or the $x$-coordinate of the position of a particle diffusing in a liquid$^{\ddagger}$. However, in the next post on Monte Carlo methods, we will see that the bell curve also arises as the distribution of the sample means when we repeat a random experiment many times, a result known as the Central Limit Theorem.

$\ddagger$: The fact that this is Normal is part of the mathematical definition of Brownian motion. Towards the end of the linked post, you will see (1) how the Normal distribution arises in the context of diffusion and (2) how Gaussian integral formulas make Normal distribution math a piece of cake.

There are many other well-known probability distributions, but hopefully the three above give you a sense of what they can look like. In the final section of this post, we'll look briefly into how to choose the right distribution to model a random outcome of interest.

### Testing for goodness of fit

Above, I mentioned that we can ignore impossibly complex sample spaces by simply assuming a particular probability distribution for a random variable we'd like to study. The question is, which probability distribution(s) can we reasonably assume? While a thorough treatment is beyond the scope of this post, at a high level, the process, known as hypothesis testing, typically proceeds as follows:

1. Hypothesize, i.e. guess, a distribution for the random variable in question. This is called the null hypothesis. Often, the null hypothesis is that the data follows a Normal distribution.
2. Gather sample data and calibrate the distribution using the sample mean, standard deviation, etc.
3. Assuming the random variable follows the (calibrated) hypothesized distribution, calculate how likely it is that hypothetical data from a sample of a certain size would differ from the theoretical distribution by at least as much as your sample data did. This is called a goodness-of-fit test. There are numerous well-known statistical goodness-of-fit tests such as the $\chi^2$ ("chi-squared") test.
4. The probability calculated in step 3 is called the $P$-value, and if it is lower than a pre-set threshold (usually 5% or 1%), i.e. if it's very unlikely that the sample data came from the hypothesized distribution, then we conclude that we must reject the null hypothesis.
Whenever we base an analysis on the assumption that a random variable follows a certain distribution, we should always use a goodness-of-fit test as a sanity check. In addition, we can visually examine a histogram of the sample data, plotted on top of the null-hypothesized distribution, to gain a qualitative understanding of the results.

### Recap and an example

In this post, we saw how the technical underpinnings of probability theory make their way into the definition of random variables, which model outcomes of random phenomena. We saw that a random variable is actually a function, not a variable, and that its input, not its output, drives the randomness in the model. Since the inputs may be far too complicated to describe directly, we can instead assume a certain probability distribution for the output and test our assumption using a statistical goodness-of-fit test. We often test for Normality so that we can take advantage of the 68-95-99.7 rule or other favorable properties to simplify our analysis.

As a simple example, suppose you're an investment banker, and you want to lend a client money overnight, taking 100,000 shares of a stock from his portfolio as collateral. If the client fails to repay, you will liquidate the stock tomorrow at the then-prevailing price. If the stock trades at USD 100.00 per share right now, and you lend the client USD 9,800,000.00, is the 2% buffer enough to protect you in the unlikely case the client fails to repay?

You download 5 years of historical data and plot the following histogram of the stock's daily % returns:

The mean of the sample data is +0.06% and the standard deviation is 0.92%, so you decide to test the null hypothesis that this stock's daily returns are Normally distributed with mean 0.06% and standard deviation 0.92%. You frantically search gtmath.com for how to actually do a $\chi^2$ test and find that it's not on there; luckily, it's all over Google, and you can even do it in Excel. You run the test and obtain a $P$-value of 0.42. Since this is not less than 5%, you conclude that the Normality assumption was not unreasonable, and thus even if the client fails to repay, the probability is still less than 2.5% (half of 5% since we only care about downward moves) that the stock drops by more than 1.84% by tomorrow, so you deem the 2% buffer sufficient and go ahead with the loan.

In the next post, I will demystify the buzzword "Monte Carlo simulation" and also introduce the Law of Large Numbers and Central Limit Theorem. For now, thanks for reading!