Towards Self-Supervised Covariance Estimation in Deep Heteroscedastic Regression
Megh Shukla, Aziz Shameem, Mathieu Salzmann, Alexandre Alahi
International Conference on Learning Representations (ICLR) 2025, Singapore
Problem Statement
In our previous work, we addressed the challenge of unsupervised covariance estimation through better parameterization (TIC) of the covariance. However, TIC requires significant computational resources as computing the hessian is expensive. While there can be multiple approaches to mitigate this, such as using just the gradient or utilizing smaller models, the inherent challenges still remain.
These challenges stems from a key limitation of deep heteroscedastic regression: estimating sample-dependent covariance in an unsupervised manner. Without labels, the covariance estimator solely relies on patterns in the residuals across various samples and may be inaccurate. Therefore, can we use self-supervision to improve covariance estimation? Specifically, we ask:
- How to supervise covariance estimation when annotations are available?
- How to generate pseudo-labels when annotations are not available?
Supervision
To gain intuition, we propose studying the simple task of learning a bivariate normal distribution, as shown below. Given samples from the unknown true distribution, how well do different supervision objectives optimize the predicted distribution to match the true distribution?
KL Divergence
We first turn to the KL divergence for supervising the learning of the covariance, which is a popular measure to quantify the difference between two distributions. Moreover, the KL divergence gives rise to popular machine learning objectives such as the cross entropy and negative log-likelihood. The KL divergence between two gaussian distributions $p$ and $q$ is defined as
An astute reader may notice that the above fomulation is not very helpful for regression. This is because the mean and covariance are unknown for the target distribution. Instead, we rely on i.i.d. samples $(x_i, y_i)$ to supervise the covariance. We therefore ask, how can the KL Divergence be formulated for deep heteroscedastic
regression?
A logical approach would be to replace each label with a distribution. Specifically, for a sample $xi, yi$ from the dataset, the pseudo target distribution can be set to $\mathcal{N}(yi, \Sigma^{(\texttt{prior})}_Y (X))$. However, this approach requires calibrating the KL Divergence since the optimal value for the covariance is not the same as the prior! This can be seen through a simple setting in Lemma 1.
So what does this lemma imply? If we swap out $(x_i, y_i)$ with $(x_i, \mathcal{N}(yi, \Sigma^{(\texttt{prior})}_Y (X)))$, the predicted covariance is twice as much as the true covariance. This also motivates the need for calibrating the KL divergence such that the resulting optimal value is the same as the true covariance.
With this formulation, $\widehat{\Sigma}_Y(x) \approx \Sigma_Y(x)$. Moreover, this solution truly allows the KL Divergence to act as a regularizer over the covariance. When the target covariance is unknown and cannot be set as the prior, the calibrated formulation gives the optimal solution $\widehat{\Sigma}_Y(x) \approx \dfrac{\Sigma^{(\texttt{prior})}_Y (X) + \Sigma_Y(x)}{2}$. We can observe this in the graphic below. Unlike the negative log-likelihood which shows significant fluctuations, the KL divergence demonstrates stability in the predicted covariance. This is because the prior anchors the predicted covariance, preventing disruptions.
However, we observe that a residual covariance across both, the negative log-likelihood and the KL divergence. Why does this happen? Our solution in Lemma 1 is reached only when the mean estimator has converged to the true value, and when exposed to multiple targets $y_i$ for the same $x$. However, this may not hold true practically because
- Samples in a batch are i.i.d and it is unlikely that the same observation with different targets are there in the batch.
- The mean estimator may not have converged. Moreover, convergence is not uniform across the dataset!
In practice, the predicted covariance is heavily dependent on the difference (residual) between the prediction and the target. Specifically, the optimal solution is:
If this residual is large, it dominates the uncertainty estimate, pushing the model toward learning something that looks more like an error-driven “pseudo-covariance” than the true underlying covariance.
Details! The model essentially aligns the covariance with the error direction (the line between prediction and true value). This slows down optimization because now, the residual affects how the mean is learned. The larger the residual, the more it overpowers prior knowledge about the covariance, making updates less stable. While the negative log-likelihood fails, even the KL divergence is ineffective because it gets overwhelmed by the residuals, leading to unstable updates, especially with higher learning rates.
Because KL Divergence inherits the same weaknesses as the standard negative log-likelihood (NLL), can we use the 2-Wasserstein distance as a better way to guide covariance learning? Intuitively, the 2-Wasserstein distance does not rely on the residual, making it much more stable. Moreover, it allows for a mechanism to directly supervise the covariance.
2-Wasserstein Distance
The 2-Wasserstein distance measures how different two probability distributions are by considering both their mean differences and covariance structures. It provides a more stable way to compare distributions compared to KL divergence, as it captures overall shape differences rather than just pointwise mismatches. For two multivariate normal distributions, the 2-Wasserstein distance is defined as
This formulation, however, requires computing the root of a matrix, which typically involves eigendecomposition. Unfortunately, the eigendecomposition in popular deep learning frameworks can potentially lead to unstable gradients. In fact, if the two distributions are commutative, then $\mathcal{W}_2(\mathcal{N}_1, \mathcal{N}_2) = \parallel \mu_1 - \mu_2 \parallel^2 + \parallel \Sigma_1^{1/2} - \Sigma_2^{1/2} \parallel ^2_F$. But what does it mean for two distributions to be commutative? When two matrices commute, it means that $AB=BA$, which implies that applying one transformation first and then the other gives the same result as applying them in the reverse order. Intuitively, it means that the two covariance matrices share the same eigenvectors, and transform the data along the same axes. What might differ is the degree of transformation, which corresponds to different eigenvectors. However, assuming commutativity just to avoid eigendecomposition is a very strong assumption. It essentially means that the structure of the covariance is known a-priori! Is it still possible to avoid eigendecomposition for non-commutative matrices? With theorem 1, we show that $\mathcal{W}_2(\mathcal{N}_1, \mathcal{N}_2) \leq \parallel \mu_1 - \mu_2 \parallel^2 + \parallel \Sigma_1^{1/2} - \Sigma_2^{1/2} \parallel ^2_F$ for any set of covariance matrices!
Theorem 1 is significant from a practical viewpoint. The bound allows us to extend the simplification for the case of commutative matrices to the more general case of non-commutative matrices. By doing so, we remove the eigendecomposition and make the optimization more stable. Finally, reducing this bound also reduces the true 2-Wasserstien distance between two distributions!
So how does the 2-Wasserstein compare with the KL Divergence? We study this through our toy example:
We make two key observations: at higher learning rates, the negative log-likelihood and the KL divergence suffer from unstable optimization. This happens samples not aligned with respect to the predicted covariance tend to destabilize it, since these samples are considered very far from the true distribution. In contrast, the 2-Wasserstein distance allows for smoother convergence since it does not depend on residuals. The 2-Wasserstein benefits from direct supervision of the covariance. Moreover, if the prior covariance is reasonably close to the true covariance, we also get lower likelihood values!
Towards Self-Supervision
While we have studied objectives for supervision, we still need to find labels for the covariance! Such labels can often come from good prior knowledge about the task being solved. However without this prior, we need to find signals for the covariance within the data. We do this by looking at nearby examples and using their covariance as a proxy for the unknown true covariance. By doing so, we capture two key intuitions:
- The target has a high (co-)variance if it exhibits large variations in a small neighborhood of the input.
- The closer another input is to our input, the likelier it is that the corresponding target is a potential label for our input.
We describe our algorithm and input below.
Here’s how it works:
-
Find Similar Data Points (Neighborhood Selection): For each input $x$, we find other similar inputs in the dataset. The similarity is measured using Mahalanobis distance, which takes into account both distance and spread (instead of just Euclidean distance). Finally, we interpret these distances probabilistically through the softmax operation. Neighbors with smaller distances are much likely to be neighbors in comparison to samples with larger distances.
-
Compute the Variation in Their Outputs: If an input has multiple similar neighbors with very different outputs, it suggests high variability. If all nearby points produce similar outputs, the model has lower variance in its prediction. Moreover, different neighbors contribute to different degrees. We use the probabilistic interpretation of neighbors to compute the `expected’ covariance for out input sample.
Why does this work? The network spends a significant amount of time trying to identify patterns from just the residuals. This may not even converge to a reasonable value! By explicitly encoding patterns within the dataset, we provide a much stronger signal to supervise the covariance.
We now have all the ingredients needed for self-supervision! For each sample, we obtain a pseudo-label for the covariance, which is trained using the 2-Wasserstein distance upper bound.
Experiments
Does using self-supervision truly retain accuracy while being significantly cheaper computationally? We study this through a similar setup as TIC-TAC. We conduct experiments across real and synthetic datasets, spanning univariate and multivariate analysis.
Univariate

Multivariate


UCI Regression

2D Human Pose Estimation

Conclusion
Our study is best concluded by the following points:
🔹 Traditional methods struggle to estimate covariance accurately without supervision.
🔹 We show how the KL divergence can be calibrated for regularization, but noted its sensitive to residuals.
🔹 The 2-Wasserstein bound provides a stable way to optimize covariance estimation, avoiding the pitfalls of KL divergence and NLL.
🔹 Our simple neighborhood-based heuristic generates effective pseudo-labels, enabling self-supervised learning.
🔹 The result? A computationally efficient approach that improves both accuracy and convergence in deep heteroscedastic regression!
Acknowledgement
We thank the reviewers, for their valuable comments and insights. We also thank Reyhaneh Hosseininejad for her help in preparing the paper.
This research is funded by the Swiss National Science Foun- dation (SNSF) through the project Narratives from the Long Tail: Transforming Access to Audiovisual Archives (Grant: CRSII5 198632). The project description is available on: https://www.futurecinema.live/project/
Citation
If our work is useful, please consider citing the accompanying paper and starring our code on GitHub!
@inproceedings{
shukla2025towards,
title={Towards Self-Supervised Covariance Estimation in Deep Heteroscedastic Regression},
author={Megh Shukla and Aziz Shameem and Mathieu Salzmann and Alexandre Alahi},
booktitle={The Thirteenth International Conference on Learning Representations},
year={2025},
url={https://openreview.net/forum?id=Q1kPHLUbhi}
}