Interpreting Expectations and Medians as Minimizers

I show how several properties of the distribution of a random variable—the expectation, conditional expectation, and median—can be viewed as solutions to optimization problems.

When most people first learn about expectations, they are given a definition such as

for some random variable with density and function . The instructor might motivate the expectation as the average of a random variable over many repeated experiments and then observe that expectations are really averages.

However, another interpretation of the expectation of is that it is a minimizer for a particular loss function, and this interpretation actually generalizes to other properties of ’s distribution. The goal of this post is to give a few examples with detailed proofs. I find this interpretation useful because it helps explain why the mean squared error loss is so common.

Expectation and the squared loss

Let be a square-integrable random variable. Then we claim that

We can prove this with a little clever algebra,

The above works because

The first term in Equation does not depend on and therefore can be ignored in our optimization calculation. Furthermore, note that the cross term is equal to zero. This is because

If any step is confusing, just recall that is nonrandom, and for any constant . What this means is that our original optimization problem reduces to

Since is a convex function, is the minimizer:

Conditional expectation and the best predictor

Now let’s consider a more complicated example. Consider two square-integrable random variables and . Let be a class of all square-integrable functions of . What is the best function such that the mean squared error is minimized or

In words, what is the best predictor of given ? It turns out, it is the conditional expectation. The derivation is nearly the same as above. We add and subtract two terms, , do a little algebra, and show that the cross term goes to zero:

Using two properties,

it is straightforward to see that the cross term is zero:

Step holds because is a function of but not —intuitively, if we didn’t condition on the randomness in , then would be nonrandom—and therefore is a function of just and can be pulled out of the conditional expectation. In step , we use

Once again, the first term in Equation does not depend on , and therefore

Again, we have a convex function, this time with a minimum at . The first term in Equation , , has a nice interpretation: given the best predictor , this is the lower bound on our loss. Our remaining loss is a function of how close the conditional expectation is to .

Median and the absolute loss

Finally, let’s look at a different loss function, the absolute value. Let be a continuous random variable with a Lebesgue density and CDF . The median of is the value such that . In words, it is the value such that a number is equally likely to fall above or below It. However, we can show that is also the minimizer to the expected absolute loss,

This is equivalent to showing that

Let’s first compute the derivative using Leibniz’s rule for improper integrals,

Note that we can move the limit (derivative) inside the integral because we have well-behaved Lebesgue integrals. Setting this derivative equal to , we get

Thus, .

Conclusion

In my mind, these are beautiful and deep results with interesting implications. For example, while there are a number of good reasons to prefer the squared loss function—differentiability, convexity—the first result above provides another significant reason. Since the expectation of the sample mean is the true population parameter, we have a straightforward and unbiased way to estimate a quantity that is the best we can do for our loss.

The conditional expectation result shows that for any square-integrable random variables and and for any function in a massive class of square-integrable functions , the best possible predictor is the conditional expectation . This has many implications for other models, such as linear regression.

Finally, I first came across the fact that the median minimizes Equation in a proof that the median is always within one standard deviation of the mean:

The two steps labeled are due to Jensen’s inequality since both the absolute-value and square functions are convex. Step holds because the median is the minimizer.