Calibrated CLIP Probabilities using Conformal Prediction

Authors

Affiliations

Fadel M. Megahed

Farmer School of Business, Miami University

Ying-Ju (Tessa) Chen

Department of Mathematics, University of Dayton

Bianca Maria Colosimo

Dipartimento di Meccanica, Politecnico di Milano

Marco Luigi Giuseppe Grasso

Dipartimento di Meccanica, Politecnico di Milano

L. Allison Jones-Farmer

Farmer School of Business, Miami University

Published

November 3, 2025

1 Overview

This file shows how we:

run few-shot classification with CLIP,
add conformal prediction for set-valued outputs, and
compute both standard and conformal metrics.

It is organized to be easy to follow and to mirror our earlier ad‑hoc functions in fmegahed/qe_genai, now refactored into modules.

1.1 Modules

I refactored our previous code into modules so it would be easier to read and allow us to focus on the outputs from this file. The modules are described below:

image_io.py:
- load_image — robust local/URL image loader with timeout handling.
- sample_urls — helper for random sampling and returning both chosen and remaining file lists.
wrappers.py:
- encode_and_normalize — encodes images with CLIP and L2-normalizes embeddings.
- CLIPWrapper — lightweight sklearn-compatible wrapper exposing predict_proba and predict for few-shot inference.
zero_shot.py:
- evaluate_zero_shot_predictions — evaluates zero-shot CLIP results, reports metrics, and optionally saves confusion matrices.
conformal.py:
- few_shot_fault_classification_conformal — main routine for few-shot + conformal prediction, including
  - Global and Mondrian (class-conditional) calibration,
  - Optional probability calibration (isotonic or sigmoid), and
  - An allow_empty flag to control abstention behavior.
metrics.py:
- compute_classification_metrics — computes standard classification metrics from experiment CSVs.
- compute_conformal_set_metrics — computes coverage and efficiency metrics for conformal sets.
- make_true_labels_from_counts — reconstructs ground-truth label vectors from class counts.
viz.py:
- Optional visualization helpers for confusion matrices.

1.2 Changes from Prior Work

Compared to our earlier implementation in fmegahed/qe_genai, we have three major updates:

wrappers.py — CLIP is now wrapped as an sklearn-style classifier (CLIPWrapper), exposing predict_proba and predict.
This abstraction makes the model compatible with downstream calibration and conformal logic.
few_shot_fault_classification_conformal —
The prior few_shot_fault_classification has been replaced with a version that:
- Implements Global and Mondrian conformal prediction directly.
- Allows optional probability calibration before conformal scoring, which in this demo I use isotonic regression for it.
- Provides an allow_empty=False flag to suppress abstentions and always yield at least one label.
Metrics expansion —
- compute_conformal_set_metrics adds conformal-specific diagnostics such as coverage and class-conditional coverage.

1.3 Probability Calibration

Before conformal prediction, we optionally calibrate the model probabilities to improve reliability.
Given uncalibrated positive-class probabilities $p_{\text{uncal}}$, we learn a monotone mapping $g$ on a calibration split and apply:

\[p_{\text{cal}} = g(p_{\text{uncal}}), \quad \hat{p}(y \mid x) = \begin{bmatrix} 1 - p_{\text{cal}} \\ p_{\text{cal}} \end{bmatrix}.\]

1.3.1 Calibration Options

1.3.1.1 Isotonic Regression

Uses sklearn.isotonic.IsotonicRegression(out_of_bounds="clip").
Learns a piecewise-monotone calibration curve $g$ from pairs $(p_{\text{uncal}}, y)$.
Overall, per the sklearn manual, isotonic will perform well or better than sigmoid when there is enough data (samples greater than 1000) to avoid overfitting.

1.3.1.2 Sigmoid (Platt-Style) Calibration

Uses sklearn.linear_model.LogisticRegression on logit-transformed probabilities:
\[\operatorname{logit}(p) = \log\!\left(\frac{p}{1 - p}\right).\]
The calibrated probability is: \[p_{\text{cal}} = \sigma(a \cdot \operatorname{logit}(p_{\text{uncal}}) + b),\] where $\sigma$ is the logistic function and $a, b$ are learned from calibration data.
Per the sklearn manual, this recommended for small calibration sets but not necessarily for imbalanced data.

Both calibrations are implemented internally via a simple adapter that applies transform_fn to the positive class and reconstructs [1 - p, p].

2 Conformal Prediction Background

Conformal prediction converts point predictions into prediction sets with guaranteed coverage probability $1 - \alpha$. For classification, the target is:

\[\Pr(Y_{n+1} \in \hat{C}_{n+1}) \ge 1 - \alpha.\]

The guarantee is finite-sample valid, making CP suitable for few-shot or safety-critical tasks.

2.1 Nonconformity Scores

Each calibration sample $(x_i, y_i)$ yields a score:

\[s_i = 1 - p_{\text{cal}}(y_i \mid x_i),\]

where smaller $s_i$ implies higher model confidence in the correct label.

2.2 Finite-Sample Quantile

The conformal threshold is the conservative quantile:

\[q_{1-\alpha} = \text{Quantile}_{\lceil (1-\alpha)(n+1) \rceil / n}(s).\]

2.3 Prediction Set Construction

For a new input $x^*$, include class $y$ in the prediction set if:

\[p_{\text{cal}}(y \mid x^*) \ge 1 - q_{1-\alpha}.\]

Thus the conformal set is:

\[\hat{C}(x^*) = \{ y : p_{\text{cal}}(y \mid x^*) \ge 1 - q_{1-\alpha} \}.\]

2.4 Mondrian (Class-Conditional) Conformal Prediction

To ensure per-class coverage, Mondrian conformal prediction computes thresholds separately for each class:

\[q_y = \text{Quantile}_{1-\alpha}\big(\{\,1 - p_{\text{cal}}(y_i \mid x_i) : y_i = y\,\}\big).\]

The resulting guarantee is:

\[\Pr(Y_{n+1} \in \hat{C}(x_{n+1}) \mid Y_{n+1} = y) \ge 1 - \alpha.\]

2.5 Implementation Mapping

Concept	Implementation Detail
Score definition and quantile	`_finite_sample_quantile` in `conformal.py` implements the finite-sample rule above.
Class-specific thresholds	`_fit_mondrian_thresholds` computes per-class $q_y$ from calibration scores $s = 1 - p_{\text{true}}$.
Set construction	`_predict_sets_mondrian` and `_predict_sets_global` include a label $y$ if p_y(x^*) - q_y$. When `allow_empty=False`, the argmax class is inserted when the set would otherwise be empty.
Probabilities $p_y(x)$	`CLIPWrapper.predict_proba` converts cosine-similarity logits to probabilities using a softmax function with a temperature parameter.

3 Imports

In the code chunk below, I import our functions from the internal package. Additionally, I utilize the torchinfo package to show you the details of our loaded CLIP model. Note that this allowed me to observe the following:

Comment in our previous paper we did not load the 336px version of the ViT-L/14 model. If we wanted to do that, we should have explicitly loaded the model as: ViT-L/14@336px.

import sys, os

sys.path.insert(0, os.path.abspath("."))

## Import our functions/package
from conformal_clip import (
    get_image_urls, load_image,
    evaluate_zero_shot_predictions,
    CLIPWrapper, encode_and_normalize,
    few_shot_fault_classification_conformal,
    compute_classification_metrics, compute_conformal_set_metrics,
    make_true_labels_from_counts
)
print("Package imported.")


## Load CLIP model
import torch
import clip  # pip install git+https://github.com/openai/CLIP.git
from torchinfo import summary

device = "cuda" if torch.cuda.is_available() else "cpu"

model, preprocess = clip.load("ViT-L/14", device=device)

vision_sum = summary(
  model.visual, verbose=0,
  input_size=(1, 3, 224, 224), # batch, channels, H, W
)
print("-----------------------------------------------")
print("Comment in our previous paper we did not load the 336px version of the\nViT-L/14 model. If we wanted to do that, we should have explicitly loaded\nthe model as: ViT-L/14@336px.")
print(vision_sum)

Package imported.
-----------------------------------------------
Comment in our previous paper we did not load the 336px version of the
ViT-L/14 model. If we wanted to do that, we should have explicitly loaded
the model as: ViT-L/14@336px.
===============================================================================================
Layer (type:depth-idx)                        Output Shape              Param #
===============================================================================================
VisionTransformer                             [1, 768]                  1,050,624
├─Conv2d: 1-1                                 [1, 1024, 16, 16]         602,112
├─LayerNorm: 1-2                              [1, 257, 1024]            2,048
├─Transformer: 1-3                            [257, 1, 1024]            --
│    └─Sequential: 2-1                        [257, 1, 1024]            --
│    │    └─ResidualAttentionBlock: 3-1       [257, 1, 1024]            12,596,224
│    │    └─ResidualAttentionBlock: 3-2       [257, 1, 1024]            12,596,224
│    │    └─ResidualAttentionBlock: 3-3       [257, 1, 1024]            12,596,224
│    │    └─ResidualAttentionBlock: 3-4       [257, 1, 1024]            12,596,224
│    │    └─ResidualAttentionBlock: 3-5       [257, 1, 1024]            12,596,224
│    │    └─ResidualAttentionBlock: 3-6       [257, 1, 1024]            12,596,224
│    │    └─ResidualAttentionBlock: 3-7       [257, 1, 1024]            12,596,224
│    │    └─ResidualAttentionBlock: 3-8       [257, 1, 1024]            12,596,224
│    │    └─ResidualAttentionBlock: 3-9       [257, 1, 1024]            12,596,224
│    │    └─ResidualAttentionBlock: 3-10      [257, 1, 1024]            12,596,224
│    │    └─ResidualAttentionBlock: 3-11      [257, 1, 1024]            12,596,224
│    │    └─ResidualAttentionBlock: 3-12      [257, 1, 1024]            12,596,224
│    │    └─ResidualAttentionBlock: 3-13      [257, 1, 1024]            12,596,224
│    │    └─ResidualAttentionBlock: 3-14      [257, 1, 1024]            12,596,224
│    │    └─ResidualAttentionBlock: 3-15      [257, 1, 1024]            12,596,224
│    │    └─ResidualAttentionBlock: 3-16      [257, 1, 1024]            12,596,224
│    │    └─ResidualAttentionBlock: 3-17      [257, 1, 1024]            12,596,224
│    │    └─ResidualAttentionBlock: 3-18      [257, 1, 1024]            12,596,224
│    │    └─ResidualAttentionBlock: 3-19      [257, 1, 1024]            12,596,224
│    │    └─ResidualAttentionBlock: 3-20      [257, 1, 1024]            12,596,224
│    │    └─ResidualAttentionBlock: 3-21      [257, 1, 1024]            12,596,224
│    │    └─ResidualAttentionBlock: 3-22      [257, 1, 1024]            12,596,224
│    │    └─ResidualAttentionBlock: 3-23      [257, 1, 1024]            12,596,224
│    │    └─ResidualAttentionBlock: 3-24      [257, 1, 1024]            12,596,224
├─LayerNorm: 1-4                              [1, 1024]                 2,048
===============================================================================================
Total params: 303,966,208
Trainable params: 303,966,208
Non-trainable params: 0
Total mult-adds (Units.GIGABYTES): 51.95
===============================================================================================
Input size (MB): 0.60
Forward/backward pass size (MB): 357.91
Params size (MB): 808.62
Estimated Total Size (MB): 1167.13
===============================================================================================

4 The “Stochastic Textured Surfaces” Experiment

4.1 Loading the Images

# --- Load and split local images --------------------
import os, glob
import numpy as np
from conformal_clip.image_io import load_image, sample_urls

# Base folder
base_dir = os.path.join("data", "textile_images", "simulated")
nominal_dir = os.path.join(base_dir, "nominal")
local_dir   = os.path.join(base_dir, "local")
global_dir  = os.path.join(base_dir, "global")

# Gather file paths for common image extensions (no new helper function)
exts = ["jpg", "jpeg", "png"]
nominal_paths = sum([glob.glob(os.path.join(nominal_dir, f"*.{e}")) for e in exts], [])
local_paths   = sum([glob.glob(os.path.join(local_dir,   f"*.{e}")) for e in exts], [])
global_paths  = sum([glob.glob(os.path.join(global_dir,  f"*.{e}")) for e in exts], [])

# Reproducible RNG
rng = np.random.default_rng(2024)

# --------------------------
# (1) Test set (same size as paper): 100 images
#    50 Nominal, 25 Local, 25 Global
# --------------------------
test_size = 100
test_nominal_paths, remaining_nominal_paths = sample_urls(nominal_paths, test_size, rng)
test_global_paths,  remaining_global_paths  = sample_urls(global_paths,  test_size // 2, rng)
test_local_paths,   remaining_local_paths   = sample_urls(local_paths,   test_size // 2, rng)

test_defective_paths = test_global_paths + test_local_paths
test_paths = test_nominal_paths + test_defective_paths
test_image_filenames = [os.path.basename(p) for p in test_paths]

# ------------------------------------------------
# (2) Training (few-shot): 50 per class total
#     Defective = 25 global + 25 local
# ------------------------------------------------
train_nominal_n = 50
train_def_global_n = 25
train_def_local_n  = 25

learn_nominal_paths, remaining_nominal_paths = sample_urls(remaining_nominal_paths, train_nominal_n, rng)
learn_global_paths,  remaining_global_paths  = sample_urls(remaining_global_paths,  train_def_global_n, rng)
learn_local_paths,   remaining_local_paths   = sample_urls(remaining_local_paths,   train_def_local_n,  rng)
learn_defective_paths = learn_global_paths + learn_local_paths

# ------------------------------------------------
# (3) Calibration: same sizes as training
# ------------------------------------------------
cal_nominal_paths, remaining_nominal_paths = sample_urls(remaining_nominal_paths, train_nominal_n, rng)
cal_global_paths,  remaining_global_paths  = sample_urls(remaining_global_paths,  train_def_global_n, rng)
cal_local_paths,   remaining_local_paths   = sample_urls(remaining_local_paths,   train_def_local_n,  rng)
cal_defective_paths = cal_global_paths + cal_local_paths

# ------------------------------------------------
# (4) Descriptions for few-shot references
# ------------------------------------------------
learn_nominal_fns = [os.path.basename(p) for p in learn_nominal_paths]
learn_global_fns  = [os.path.basename(p) for p in learn_global_paths]
learn_local_fns   = [os.path.basename(p) for p in learn_local_paths]

nominal_descriptions = [
    f"Image {fn}: nominal textile, consistent weave, no visible defects."
    for fn in learn_nominal_fns
]
global_descriptions = [
    f"Image {fn}: global distortion, uniform shift across texture."
    for fn in learn_global_fns
]
local_descriptions = [
    f"Image {fn}: localized defect disrupting weave pattern."
    for fn in learn_local_fns
]
defective_descriptions = global_descriptions + local_descriptions

# ------------------------------------------------
# (5) Load and preprocess images
# ------------------------------------------------
nominal_images   = [preprocess(load_image(p)).unsqueeze(0).to(device) for p in learn_nominal_paths]
defective_images = [preprocess(load_image(p)).unsqueeze(0).to(device) for p in learn_defective_paths]

calib_nominal_images   = [preprocess(load_image(p)).unsqueeze(0).to(device) for p in cal_nominal_paths]
calib_defective_images = [preprocess(load_image(p)).unsqueeze(0).to(device) for p in cal_defective_paths]
calib_images = calib_nominal_images + calib_defective_images
calib_labels = ["Nominal"] * len(calib_nominal_images) + ["Defective"] * len(calib_defective_images)

test_nominal_images   = [preprocess(load_image(p)).unsqueeze(0).to(device) for p in test_nominal_paths]
test_defective_images = [preprocess(load_image(p)).unsqueeze(0).to(device) for p in test_defective_paths]
test_images = test_nominal_images + test_defective_images

# ------------------------------------------------
# (6) Bookkeeping for downstream metrics
# ------------------------------------------------
labels = ["Nominal", "Defective"]
label_counts = [len(test_nominal_images), len(test_defective_images)]

print(
    f"Train few-shot: Nominal={len(nominal_images)}, Defective={len(defective_images)} | "
    f"Calib: Nominal={len(calib_nominal_images)}, Defective={len(calib_defective_images)} | "
    f"Test: Nominal={len(test_nominal_images)}, Defective={len(test_defective_images)}"
)

Train few-shot: Nominal=50, Defective=50 | Calib: Nominal=50, Defective=50 | Test: Nominal=100, Defective=100

4.2 Run conformal few-shot

Despite what is mentioned in the sklearn manual, the isotonic calibration was much better for this example. See the results for the isotonic regression below.

results = few_shot_fault_classification_conformal(
    model=model,
    test_images=test_images,
    test_image_filenames=test_image_filenames,
    nominal_images=nominal_images,
    nominal_descriptions=nominal_descriptions,
    defective_images=defective_images,
    defective_descriptions=defective_descriptions,
    calib_images=calib_images,
    calib_labels=calib_labels,
    alpha=0.1,
    temperature=1.0,
    mondrian=True,
    class_labels=("Nominal", "Defective"),
    csv_path="results",
    csv_filename="exp03_results_conformal.csv",
    print_one_liner=False,
    prob_calibration="isotonic",  # or alternatively "sigmoid"
    allow_empty=False
)

4.3 Classification Metrics

labels = ["Nominal", "Defective"]
label_counts = [100, 100]

cls_metrics = compute_classification_metrics(
    "results/exp03_results_conformal.csv",
    labels=labels,
    label_counts=label_counts,
    save_confusion_matrix=True,
    cm_file_path= "results",
    cm_file_name= "exp03_conf_matrix.png",
    cm_title = "Confusion Matrix for Experiment 03 (Isotonic)"
)

cls_metrics.style.hide()

Metric	Value
Accuracy	0.990000
Sensitivity (Recall)	0.980000
Specificity	1.000000
Precision	1.000000
F1 Score	0.989899
AUC	0.989800

4.4 Conformal Set Metrics

conf_metrics = compute_conformal_set_metrics(
    "results/exp03_results_conformal.csv",
    labels=labels,
    label_counts=label_counts
)

conf_metrics.style.hide()

Metric	Value
Alpha	0.100000
Mondrian	True
Coverage (overall)	0.990000
Coverage [Nominal]	1.000000
Coverage [Defective]	0.980000