Over the past months, I have given several talks about Monte Carlo conformal prediction and the problem of calibrating with uncertain ground truth, for example, stemming from annotator disagreement. Each time, the audience had great questions and ideas for extensions and interesting applications. In this article, I want to provide a sort of FAQ for our work.
Are code and data available?
Yes, code and data are on GitHub. Code includes both Monte Carlo conformal prediction as well as the plausibility regions from v1 of the paper.
Can you derive the conformal $p$-values used in the paper?
The connection of conformal prediction and $p$-values is scattered across the literature and there is, to the best of my knowledge, no good reference to understand this. So we added a thorough derivation in Appendix B of the paper.
How do you get the plausibilities $lambda$ in practice from different formats of annotations?
In a nutshell, this is a modeling choice and depends on the annotations you have access to. In Section 3.1, we give two examples for annotators providing single labels or partial rankings of labels to define the corresponding aggregated distribution $mathbb{P}_{text{agg}}^{Y|X}$ through an aggregation model
$P_{text{agg}}(Y=y|X=x) = int int p(y|lambda)p(lambda|b,x)p(b|x) db dlambda$
where $lambda$ are the plausibilities — for classification this is just a vector such that $p(ylambda) = lambda_y)$ that defines the probability for class $y$ —, $y$ the target label, $x$ the example and $b$ an annotation. Essentially, obtaining plausibilities in this model boils down to definint $p(lambda|b,x)$. In the paper, we simplified this to $p(lambda|b)$. So we assume that annotators draw annotations from a model $p(b|x)$ that is generally unknown and that we do not need to care about, and then we obtain plausibilities from $p(lambda|b)$. As mentioned above, if $b$ is a single label per example/annotator or a partial ranking of labels, we have example models in Section 3.1. We have another example in this paper in Appendix A. This assumes a binary classification problem where annotations are defined on some scale, e.g., a Likert scale. We put a Gaussian distribution over this scale and defining the positive and negative class using a threshold (which can be fitted). Often, $p(lambda|b)$ essentially translates between the format of your annotations and the format of your plausibilities.
Why not perform conformal prediction directly in the plausibility space?
We tried this and also reported on this in v1 of the paper. We also have a follow-up technical report, because conformal prediction in the plausibility space essentially gives you a conformal way to construct credal sets: Conformalized Credal Set Predictors
Can’t we directly calibrate against annotations?
In many settings you can and should be doing this. But there is a hidden assumption: it assumes that the format of your annotations matches the format of your plausibilities. This means that for classification your plausibilities are categorical distributions and your annotations are single labels (each annotator gives you a single label). If there is a mismatch, however, you need some model $p(lambda|b)$ that essentially translates between the two. An example we give in the paper is using partial rankings of labels for a classification task, see Section 3.1.
Isn’t there also uncertainty in the annotations themselves, like annotation noise?
Yes there definitely is. In our paper, however, annotation uncertainty is “hidden” within the aggregation model $p(lambda|b)$ used to obtain the plausibilities. We have a separate paper on exactly this difference where we define ground truth uncertainty as decomposing into inherent uncertainty and annotation uncertainty. Monte Carlo conformal prediction deals only with inherent uncertainty, i.e., the fact that the plausibilities can have high entropy (there is no single crisp label). In this work, we essentially show that we don’t just want to have a point estimate of $p(lambda|b)$, but we want to be able to sample from it directly. The distribution of plausibilities $lambda sim p(lambda|b)$ describes the annotation uncertainty.
You can use Monte Carlo conformal prediction to tackle boht sources of uncertainty or only one of them. Instead of only sampling labels from $lambda$ for each example, you can also re-sample the plausibilities $lambda sim p(lambda|b)$ and you will be able to take into account both inherent and annotation uncertainty. You can also consider annotation uncertainty only. This essentially pretends that there is no inherent uncertainty, but the plausibilities $lambda$ are still not one-hot due to disagreement and annotation uncertainty. So you can sample $lambda sim p(lambda|b)$ and then take the top-1 label of $lambda$ for conformal prediction.
Can we guarantee arbitrary risks?
We did not explicitly explore this. I am farily confident that adapting should work with Monte Carlo conformal prediction (i.e., creating an augmented calibration set by sampling labels and applying conformal risk control). However, whether the guarantee is preserved is as of yet unclear because the $p$-value trick we use in the paper does not apply directly. Let me know if you want to work on this!
Why do we still get coverage $1 – alpha$ empirically but can only guarantee $1 – 2alpha$ theoretically?
This is still an unsolved problem. There are many similar approaches such as Jackknife+ Conformal Prediction or cross-conformal prediction that suffer from the same problem or make the same observation. I feel it all boils down to the limitation that combining dependent $p$-values cannot be done while preserving the $1 – alpha$ guarantee. On the other hand, it is difficult (and maybe might never occur in practice) to construct datasets that actually materlize the lower guarantee of $1 – 2alpha$. We actually tried to construct cases of Monte Carlo conformal prediction where there is an explicit empirical coverage gap, but we couldn’t.
Can we extend Monte Carlo conformal prediction to regression problems?
Yes, this should be possible, but requires some additional modeling assumption. For classification, a categorical distribution is fairly general. For regression, however, you would have to make an assumption of what the plausibilities are. For example, you could decide to pick normal distributions and define the aggregation model $p(blambda)$ accordingly. However, depending on annotations and task, there might be a variety of different distributions to appropriately model the targets $y$. This makes the regression case a bit more complex than the classification one.
Can you give an intuition for the coverage definition in multi-label classification experiments?
For context, multi-label conformal prediction has long been tackled using a very similar approach to Monte Carlo conformal prediction: given a label set for each example in the calibration set, repeat the example for each label and perform standard conformal prediction. If, instead of using each label once, we pretend that the plausibilities define a uniform distribution over all labels in the label set, we can perform Monte Carlo conformal prediction. Then, aggregated coverage means that the calibration procedure is allowed to decide how to distribution coverage across the label sets. Essentially, this means that we do not require the predictor to output all labels in the label set to obtain coverage, but it can obtain “partial” coverage. Whether this is desirable depends on the application.
What do the data augmentation experiments mean in practice?
For training machine learning models, data augmentation is a common practice. Monte Carlo conformal prediction essentially tells us that we can do the same for calibration. This closes an important gap: data augmentation during training intends to introduce invariances into the prediction model. Standard conformal prediction, however, would ignore these invariances and thus lead to under-coverage on part of the distribution. With Monte Carlo conformal prediction, we can just calibrate on an augmented calibration set.
Can’t we perform conformal prediction directly on differential diagnoses/partially ranked lists for the dermatology case study?
Yes, we could. It would involve assigning a conformity score for each partial ranking. This can be done using a Plackett-Luce model as described in this paper. But we still would need to try all possible partial rankings at test time to construct the actual prediction set. This is even more expensive then doing standard conformal prediction for multi-label classification. Also, this is often not intended in practice because it is unclear how to use these sets of partial rankings in many applications.
Source link
lol