Probabilistic Canonical Correlation Analysis in Detail

Probabilistic canonical correlation analysis is a reinterpretation of CCA as a latent variable model, which has benefits such as generative modeling, handling uncertainty, and composability. I define and derive its solution in detail.

Standard CCA

Canonical correlation analysis (CCA) is a multivariate statistical method for finding two linear projections, one for each set of observations in a paired dataset, such that the projected data points are maximally correlated. For a thorough explanation, please see my previous post.

I will present an abbreviated explanation here for completeness and notation. Let and be two datasets with samples each and dimensionlity and respectively. Let and be two linear projections and and be a pair of -dimensional “canonical variables”. Then the CCA objective is:

With the constraint that:

Since , the objective can be visualized as finding linear projections and such that and are close to each other on a unit ball in . If we find such projections where , then we can embed our two datasets into -dimensional space.

Probabilistic interpretation of CCA

A probabilistic interpetation of CCA (PCCA), is one in which our two datasets, and , are viewed as two sets of observations of two random variables, and , that are generated by a shared latent variable . Rather than use linear algebra to set up an objective and then solve for two linear projections and , we instead write down a model that captures these probabilistic relationships and use maximum likelihood estimates to update its parameters. See my previous post on probabilistic machine learning if that statement is not clear. The model is:

Where and are two arbitrary matrices, and and are both positive semi-definite. (Bach & Jordan, 2005) proved that the resulting maximum likelihood estimates are equivalent, up to rotation and scaling, to CCA.

It is worth being explicit about the differences between this probabilistic framing and the standard framing. In CCA, we take our data and perform matrix multiplications to get lower-dimensional representations and :

The objective is to find projections and such that and are maximally correlated.

But the probabilistic model, Equations (), is a function of random variables. If we marginalize out for either or , we get the following generative model:

Where . If we assume our data is mean-centered, meaning —we make the same assumption in CCA—and rename to , we can see that PCCA is just group factor analysis with two groups (Klami et al., 2015):

And this in turn is nice because we can just use the maximum likelihood estimates we computed for factor analysis for PCCA. To see this, let’s use block matrices to represent our data and parameters:

Where , , and and,

Then our PCCA updates are identical to our updates for factor analysis:

Futhermore, this definition gives us a joint density for that should look similar to the density in factor analysis:

See the Appendix for a derivation.

It’s worth thinking about how the properties of CCA are converted to probabilistic assumptions in PCCA. First, in CCA, and are a pair of embeddings that we correlate. The assumption is that both datasets have similar low-rank approximations. In PCCA, this property is modeled by having a shared latent variable .

Furthermore, in CCA, we proved that the canonical variables are orthogonal. In PCCA, there is no such orthogonality constraint. Instead, we assume the latent variables are independent with an isotropic covariance matrix:

This independence assumption is the probabilistic equivalent of orthogonality. The covariance matrix of the latent variables is diagonal, meaning there is no covariance between the -th and -th variables.

The final constraint of the CCA objective is that the vectors have unit length. In probabilistic terms, this is analogous to unit variance, which we have since the identity matrix is an isotropic matrix with each diagonal term equal to .

Code

For an implementation of PCCA, please see my GitHub repository of machine learning algorithms, specifically this file.

   

Appendix

1. Derivations for and

Let’s solve for and for the density:

First, note that and . Then:

If the data are mean-centered, as we assume, then:

To understand the covariance matrix ,

let’s consider and . The remaining block matrices, and have identical proofs, respectively, but with the variables renamed. Both derivations will use the fact that if and are independent. First, let’s consider :

So for the two views, and , we have:

Now, let’s consider :

Thus, the full covariance matrix for is:

  1. Bach, F. R., & Jordan, M. I. (2005). A probabilistic interpretation of canonical correlation analysis.
  2. Klami, A., Virtanen, S., Leppäaho, E., & Kaski, S. (2015). Group factor analysis. IEEE Transactions on Neural Networks and Learning Systems, 26(9), 2136–2147.