Interpreting Expectations and Medians as Minimizers

I show how several properties of the distribution of a random variable—the expectation, conditional expectation, and median—can be viewed as solutions to optimization problems.

Published

04 October 2019

When most people first learn about expectations, they are given a definition such as

$E [g (Y)] = \int g (y) f (y) d y$

for some random variable $Y$ with density $f$ and function $g$ . The instructor might motivate the expectation as the average of a random variable $Y$ over many repeated experiments and then observe that expectations are really averages.

However, another interpretation of the expectation of $Y$ is that it is a minimizer for a particular loss function, and this interpretation actually generalizes to other properties of $Y$ ’s distribution. The goal of this post is to give a few examples with detailed proofs. I find this interpretation useful because it helps explain why the mean squared error loss is so common.

Expectation and the squared loss

Let $Y$ be a square-integrable random variable. Then we claim that

$E [Y] = ar g a \in R min E [(Y - a)^{2}] .$

We can prove this with a little clever algebra,

$E [(Y - a)^{2}] = E [(Y - E [Y] + E [Y] - a)^{2}] = E [(Y - E [Y])^{2}] + E [(E [Y] - a)^{2}] + 2 E [(Y - E [Y]) (E [Y] - a)] . (1)$

The above works because

$(∣ ∣ ∣ C - D A + ∣ ∣ ∣ D - E B)^{2} = A^{2} + B^{2} + 2 A B .$

The first term in Equation $1$ does not depend on $a$ and therefore can be ignored in our optimization calculation. Furthermore, note that the cross term is equal to zero. This is because

$2 E [(Y - E [Y]) (E [Y] - a)] = 2 E [(Y - E [Y]) (E [Y] - a)] = 2 E [Y - E [Y]] (E [Y] - a) = 2 (E [Y] - E [E [Y]]) (E [Y] - a) = 2 (0) (E [Y] - a) = 0 .$

If any step is confusing, just recall that $E [Y]$ is nonrandom, and $E [c] = c$ for any constant $c$ . What this means is that our original optimization problem reduces to

$ar g a \in R min E [(E [Y] - a)^{2}] .$

Since $(\cdot)^{2}$ is a convex function, $a^{⋆} = E [Y]$ is the minimizer:

$\frac{\partial}{\partial a} E [(E [Y] - a)^{2}] - 2 E [E [Y] - a] - 2 E [Y] + 2 a a = 0 = 0 = 0 = E [Y] .$

Conditional expectation and the best predictor

Now let’s consider a more complicated example. Consider two square-integrable random variables $Y$ and $X$ . Let $G$ be a class of all square-integrable functions of $X$ . What is the best function $g \in G$ such that the mean squared error is minimized or

$g^{⋆} = ar g g \in G min E [(Y - g (X))^{2}] ?$

In words, what is the best predictor of $Y$ given $X$ ? It turns out, it is the conditional expectation. The derivation is nearly the same as above. We add and subtract two terms, $E [Y ∣ X]$ , do a little algebra, and show that the cross term goes to zero:

$E [(Y - g (X))^{2}] = E [(Y - E [Y ∣ X])^{2}] + E [(E [Y ∣ X] - g (X))^{2}] + 2 E [(Y - E [Y ∣ X]) (E [Y ∣ X] - g (X))] (2)$

Using two properties,

$E [Y] E [g (X) Y ∣ X] = E [E [Y ∣ X]] = g (X) E [Y ∣ X], Law of total expectation Pull out known factors$

it is straightforward to see that the cross term is zero:

$E [(Y - E [Y ∣ X]) (E [Y ∣ X] - g (X))] = E (E [(Y - E [Y ∣ X]) (E [Y ∣ X] - g (X)) ∣ X]) = ⋆ E (E [Y - E [Y ∣ X] ∣ X]) (E [Y ∣ X] - g (X)) = E (E [Y ∣ X] - E {E [Y ∣ X] ∣ X}) (E [Y ∣ X] - g (X)) = † E (= 0 E [Y ∣ X] - E [Y ∣ X]) (E [Y ∣ X] - g (X)) .$

Step $⋆$ holds because $E [Y ∣ X]$ is a function of $X$ but not $Y$ —intuitively, if we didn’t condition on the randomness in $X$ , then $E [Y]$ would be nonrandom—and therefore $(E [Y ∣ X] - g (X))$ is a function of just $X$ and can be pulled out of the conditional expectation. In step $†$ , we use

$E {E [Y ∣ X] ∣ X} = E [Y ∣ X] E {1 ∣ X} = E [Y ∣ X] .$

Once again, the first term in Equation $2$ does not depend on $g$ , and therefore

$ar g g \in G min E [(Y - g (X))^{2}] = ar g g \in G min E [(E [Y ∣ X] - g (X))^{2}]$

Again, we have a convex function, this time with a minimum at $g^{⋆} = E [Y ∣ X]$ . The first term in Equation $2$ , $E [(Y - E [Y ∣ X])^{2}]$ , has a nice interpretation: given the best predictor $g^{⋆}$ , this is the lower bound on our loss. Our remaining loss is a function of how close the conditional expectation $E [Y ∣ X]$ is to $Y$ .

Median and the absolute loss

Finally, let’s look at a different loss function, the absolute value. Let $X$ be a continuous random variable with a Lebesgue density $f (x)$ and CDF $F (x)$ . The median $m$ of $X$ is the value such that $m ≜ F^{- 1} (1 / 2)$ . In words, it is the value such that a number is equally likely to fall above or below It. However, we can show that $m$ is also the minimizer to the expected absolute loss,

$m = ar g a \in R min E [∣ X - a ∣] . (3)$

This is equivalent to showing that

$m = a^{⋆} such that \frac{\partial}{\partial a} E [∣ X - a^{⋆} ∣] = 0 .$

Let’s first compute the derivative using Leibniz’s rule for improper integrals,

$\frac{\partial}{\partial a} E [∣ X - a ∣] = \frac{\partial}{\partial a} \int_{- \infty}^{\infty} ∣ x - a ∣ f (x) d x = \frac{\partial}{\partial a} (\int_{- \infty}^{a} ∣ x - a ∣ f (x) d x + \int_{a}^{\infty} ∣ x - a ∣ f (x) d x) = \int_{- \infty}^{a} \frac{\partial}{\partial a} ∣ x - a ∣ f (x) d x + \int_{a}^{\infty} \frac{\partial}{\partial a} ∣ x - a ∣ f (x) d x = \int_{- \infty}^{a} - f (x) d x + \int_{a}^{\infty} f (x) d x$

Note that we can move the limit (derivative) inside the integral because we have well-behaved Lebesgue integrals. Setting this derivative equal to $0$ , we get

$\int_{- \infty}^{a} f (x) d x F_{X} (a) F_{X} (a) = \int_{a}^{\infty} f (x) d x = 1 - F_{X} (a) = 1 / 2 .$

Thus, $a = m$ .

Conclusion

In my mind, these are beautiful and deep results with interesting implications. For example, while there are a number of good reasons to prefer the squared loss function—differentiability, convexity—the first result above provides another significant reason. Since the expectation of the sample mean is the true population parameter, we have a straightforward and unbiased way to estimate a quantity that is the best we can do for our loss.

The conditional expectation result shows that for any square-integrable random variables $X$ and $Y$ and for any function $g$ in a massive class of square-integrable functions $G$ , the best possible predictor is the conditional expectation $E [Y ∣ X]$ . This has many implications for other models, such as linear regression.

Finally, I first came across the fact that the median minimizes Equation $3$ in a proof that the median is always within one standard deviation of the mean:

$∣ μ - m ∣ = ∣ E [X] - m ∣ = ∣ E [X - m] ∣ \leq ⋆ E [∣ X - m ∣] \leq † E [∣ X - μ ∣] = (E [X - μ])^{2} \leq ⋆ E [(X - μ)^{2}] = σ .$

The two steps labeled $⋆$ are due to Jensen’s inequality since both the absolute-value and square functions are convex. Step $†$ holds because the median $m$ is the minimizer.