Calibrated CLIP Probabilities using Conformal Prediction
Authors
Affiliations
Fadel M. Megahed
Farmer School of Business, Miami University
Ying-Ju (Tessa) Chen
Department of Mathematics, University of Dayton
Bianca Maria Colosimo
Dipartimento di Meccanica, Politecnico di Milano
Marco Luigi Giuseppe Grasso
Dipartimento di Meccanica, Politecnico di Milano
L. Allison Jones-Farmer
Farmer School of Business, Miami University
Published
November 3, 2025
1 Overview
This file shows how we:
run few-shot classification with CLIP,
add conformal prediction for set-valued outputs, and
compute both standard and conformal metrics.
It is organized to be easy to follow and to mirror our earlier ad‑hoc functions in fmegahed/qe_genai, now refactored into modules.
1.1 Modules
I refactored our previous code into modules so it would be easier to read and allow us to focus on the outputs from this file. The modules are described below:
image_io.py:
load_image — robust local/URL image loader with timeout handling.
sample_urls — helper for random sampling and returning both chosen and remaining file lists.
wrappers.py:
encode_and_normalize — encodes images with CLIP and L2-normalizes embeddings.
CLIPWrapper — lightweight sklearn-compatible wrapper exposing predict_proba and predict for few-shot inference.
few_shot_fault_classification_conformal — main routine for few-shot + conformal prediction, including
Global and Mondrian (class-conditional) calibration,
Optional probability calibration (isotonic or sigmoid), and
An allow_empty flag to control abstention behavior.
metrics.py:
compute_classification_metrics — computes standard classification metrics from experiment CSVs.
compute_conformal_set_metrics — computes coverage and efficiency metrics for conformal sets.
make_true_labels_from_counts — reconstructs ground-truth label vectors from class counts.
viz.py:
Optional visualization helpers for confusion matrices.
1.2 Changes from Prior Work
Compared to our earlier implementation in fmegahed/qe_genai, we have three major updates:
wrappers.py — CLIP is now wrapped as an sklearn-style classifier (CLIPWrapper), exposing predict_proba and predict.
This abstraction makes the model compatible with downstream calibration and conformal logic.
few_shot_fault_classification_conformal —
The prior few_shot_fault_classification has been replaced with a version that:
Implements Global and Mondrian conformal prediction directly.
Allows optional probability calibration before conformal scoring, which in this demo I use isotonic regression for it.
Provides an allow_empty=False flag to suppress abstentions and always yield at least one label.
Metrics expansion —
compute_conformal_set_metrics adds conformal-specific diagnostics such as coverage and class-conditional coverage.
1.3 Probability Calibration
Before conformal prediction, we optionally calibrate the model probabilities to improve reliability.
Given uncalibrated positive-class probabilities \(p_{\text{uncal}}\), we learn a monotone mapping \(g\) on a calibration split and apply:
Learns a piecewise-monotone calibration curve \(g\) from pairs \((p_{\text{uncal}}, y)\).
Overall, per the sklearn manual, isotonic will perform well or better than sigmoid when there is enough data (samples greater than 1000) to avoid overfitting.
1.3.1.2 Sigmoid (Platt-Style) Calibration
Uses sklearn.linear_model.LogisticRegression on logit-transformed probabilities: \[\operatorname{logit}(p) = \log\!\left(\frac{p}{1 - p}\right).\]
The calibrated probability is: \[p_{\text{cal}} = \sigma(a \cdot \operatorname{logit}(p_{\text{uncal}}) + b),\] where \(\sigma\) is the logistic function and \(a, b\) are learned from calibration data.
Per the sklearn manual, this recommended for small calibration sets but not necessarily for imbalanced data.
Both calibrations are implemented internally via a simple adapter that applies transform_fn to the positive class and reconstructs [1 - p, p].
2 Conformal Prediction Background
Conformal prediction converts point predictions into prediction sets with guaranteed coverage probability \(1 - \alpha\). For classification, the target is:
_predict_sets_mondrian and _predict_sets_global include a label \(y\) if p_y(x^*) - q_y$. When allow_empty=False, the argmax class is inserted when the set would otherwise be empty.
Probabilities \(p_y(x)\)
CLIPWrapper.predict_proba converts cosine-similarity logits to probabilities using a softmax function with a temperature parameter.
3 Imports
In the code chunk below, I import our functions from the internal package. Additionally, I utilize the torchinfo package to show you the details of our loaded CLIP model. Note that this allowed me to observe the following:
Comment in our previous paper we did not load the 336px version of the ViT-L/14 model. If we wanted to do that, we should have explicitly loaded the model as: ViT-L/14@336px.
import sys, ossys.path.insert(0, os.path.abspath("."))## Import our functions/packagefrom conformal_clip import ( get_image_urls, load_image, evaluate_zero_shot_predictions, CLIPWrapper, encode_and_normalize, few_shot_fault_classification_conformal, compute_classification_metrics, compute_conformal_set_metrics, make_true_labels_from_counts)print("Package imported.")## Load CLIP modelimport torchimport clip # pip install git+https://github.com/openai/CLIP.gitfrom torchinfo import summarydevice ="cuda"if torch.cuda.is_available() else"cpu"model, preprocess = clip.load("ViT-L/14", device=device)vision_sum = summary( model.visual, verbose=0, input_size=(1, 3, 224, 224), # batch, channels, H, W)print("-----------------------------------------------")print("Comment in our previous paper we did not load the 336px version of the\nViT-L/14 model. If we wanted to do that, we should have explicitly loaded\nthe model as: ViT-L/14@336px.")print(vision_sum)
Despite what is mentioned in the sklearn manual, the isotonic calibration was much better for this example. See the results for the isotonic regression below.
---title: "Calibrated CLIP Probabilities using Conformal Prediction"author: - name: Fadel M. Megahed orcid: 0000-0003-2194-5110 email: fmegahed@miamioh.edu corresponding: true affiliations: - name: Farmer School of Business, Miami University address: 800 E. High St. city: Oxford state: OH postal-code: 45056 - name: Ying-Ju (Tessa) Chen orcid: https://orcid.org/0000-0002-6444-6859 email: ychen4@udayton.edu corresponding: true affiliations: - name: Department of Mathematics, University of Dayton - name: Bianca Maria Colosimo orcid: 0000-0001-6844-2030 email: biancamaria.colosimo@polimi.it corresponding: false affiliations: - name: Dipartimento di Meccanica, Politecnico di Milano - name: Marco Luigi Giuseppe Grasso orcid: 0000-0003-3233-4198 email: marcoluigi.grasso@polimi.it corresponding: false affiliations: - name: Dipartimento di Meccanica, Politecnico di Milano - name: L. Allison Jones-Farmer orcid: 0000-0002-1529-1133 email: farmerl2@miamioh.edu corresponding: false affiliations: - name: Farmer School of Business, Miami University address: 800 E. High St. city: Oxford state: OH postal-code: 45056date: "November 03, 2025"format: html: code-fold: false code-tools: true code-download: true code-link: true highlight-style: pygments number-sections: true paged-df: true toc: true toc-float: true code-overflow: wrapjupyter: python3---# OverviewThis file shows how we: 1. run few-shot classification with CLIP, 2. **add conformal prediction for set-valued outputs**, and 3. **compute both standard and conformal metrics**. It is organized to be easy to follow and to mirror our earlier ad‑hoc functions in [fmegahed/qe_genai](https://github.com/fmegahed/qe_genai), now refactored into modules.## Modules I refactored our previous code into modules so it would be easier to read and allow us to focus on the outputs from this file. The modules are described below: - `image_io.py`: - `load_image` — robust local/URL image loader with timeout handling. - `sample_urls` — helper for random sampling and returning both chosen and remaining file lists. - `wrappers.py`: - `encode_and_normalize` — encodes images with CLIP and L2-normalizes embeddings. - `CLIPWrapper` — lightweight *sklearn*-compatible wrapper exposing `predict_proba` and `predict` for few-shot inference. - `zero_shot.py`: - `evaluate_zero_shot_predictions` — evaluates zero-shot CLIP results, reports metrics, and optionally saves confusion matrices. - `conformal.py`: - `few_shot_fault_classification_conformal` — main routine for **few-shot + conformal prediction**, including - Global and **Mondrian (class-conditional)** calibration, - Optional **probability calibration** (`isotonic` or `sigmoid`), and - An `allow_empty` flag to control abstention behavior. - `metrics.py`: - `compute_classification_metrics` — computes standard classification metrics from experiment CSVs. - `compute_conformal_set_metrics` — computes coverage and efficiency metrics for conformal sets. - `make_true_labels_from_counts` — reconstructs ground-truth label vectors from class counts. - `viz.py`: - Optional visualization helpers for confusion matrices. ## Changes from Prior WorkCompared to our earlier implementation in [fmegahed/qe_genai](https://github.com/fmegahed/qe_genai), we have **three major updates**:1. **`wrappers.py`** — CLIP is now wrapped as an *sklearn*-style classifier (`CLIPWrapper`), exposing `predict_proba` and `predict`. This abstraction makes the model compatible with downstream calibration and conformal logic.2. **`few_shot_fault_classification_conformal`** — The prior `few_shot_fault_classification` has been replaced with a version that: - Implements **Global** and **Mondrian** conformal prediction directly. - Allows optional **probability calibration** before conformal scoring, which in this demo I use **isotonic regression** for it. - Provides an `allow_empty=False` flag to suppress abstentions and always yield at least one label.3. **Metrics expansion** — - `compute_conformal_set_metrics` adds conformal-specific diagnostics such as coverage and class-conditional coverage. ## Probability CalibrationBefore conformal prediction, we optionally **calibrate the model probabilities** to improve reliability. Given uncalibrated positive-class probabilities $p_{\text{uncal}}$, we learn a monotone mapping $g$ on a calibration split and apply:$$p_{\text{cal}} = g(p_{\text{uncal}}), \quad \hat{p}(y \mid x) = \begin{bmatrix}1 - p_{\text{cal}} \\p_{\text{cal}}\end{bmatrix}.$$### Calibration Options#### Isotonic Regression- Uses `sklearn.isotonic.IsotonicRegression(out_of_bounds="clip")`. - Learns a piecewise-monotone calibration curve $g$ from pairs $(p_{\text{uncal}}, y)$. - Overall, per the [sklearn manual](https://scikit-learn.org/stable/modules/calibration.html#isotonic), isotonic will perform well or better than sigmoid when there is enough data (samples greater than 1000) to avoid overfitting.#### Sigmoid (Platt-Style) Calibration- Uses `sklearn.linear_model.LogisticRegression` on logit-transformed probabilities: $$\operatorname{logit}(p) = \log\!\left(\frac{p}{1 - p}\right).$$ The calibrated probability is: $$p_{\text{cal}} = \sigma(a \cdot \operatorname{logit}(p_{\text{uncal}}) + b),$$ where $\sigma$ is the logistic function and $a, b$ are learned from calibration data. - Per the [sklearn manual](https://scikit-learn.org/stable/modules/calibration.html#sigmoid), this recommended for small calibration sets but not necessarily for imbalanced data.Both calibrations are implemented internally via a simple adapter that applies `transform_fn` to the positive class and reconstructs `[1 - p, p]`.---# Conformal Prediction BackgroundConformal prediction converts point predictions into **prediction sets** with guaranteed coverage probability $1 - \alpha$. For classification, the target is:$$\Pr(Y_{n+1} \in \hat{C}_{n+1}) \ge 1 - \alpha.$$The guarantee is **finite-sample valid**, making CP suitable for few-shot or safety-critical tasks.## Nonconformity ScoresEach calibration sample $(x_i, y_i)$ yields a score:$$s_i = 1 - p_{\text{cal}}(y_i \mid x_i),$$where smaller $s_i$ implies higher model confidence in the correct label.## Finite-Sample QuantileThe conformal threshold is the conservative quantile:$$q_{1-\alpha} = \text{Quantile}_{\lceil (1-\alpha)(n+1) \rceil / n}(s).$$## Prediction Set ConstructionFor a new input $x^*$, include class $y$ in the prediction set if:$$p_{\text{cal}}(y \mid x^*) \ge 1 - q_{1-\alpha}.$$Thus the conformal set is:$$\hat{C}(x^*) = \{ y : p_{\text{cal}}(y \mid x^*) \ge 1 - q_{1-\alpha} \}.$$## Mondrian (Class-Conditional) Conformal PredictionTo ensure per-class coverage, Mondrian conformal prediction computes thresholds separately for each class:$$q_y = \text{Quantile}_{1-\alpha}\big(\{\,1 - p_{\text{cal}}(y_i \mid x_i) : y_i = y\,\}\big).$$The resulting guarantee is:$$\Pr(Y_{n+1} \in \hat{C}(x_{n+1}) \mid Y_{n+1} = y) \ge 1 - \alpha.$$## Implementation Mapping| Concept | Implementation Detail ||----------|------------------------|| **Score definition and quantile** |`_finite_sample_quantile` in `conformal.py` implements the finite-sample rule above. || **Class-specific thresholds** |`_fit_mondrian_thresholds` computes per-class $q_y$ from calibration scores $s = 1 - p_{\text{true}}$. || **Set construction** |`_predict_sets_mondrian` and `_predict_sets_global` include a label $y$ if \ p_y(x^*) \ge 1 - q_y$. When `allow_empty=False`, the argmax class is inserted when the set would otherwise be empty. || **Probabilities $p_y(x)$** |`CLIPWrapper.predict_proba` converts cosine-similarity logits to probabilities using a softmax function with a temperature parameter. |---# ImportsIn the code chunk below, I import our functions from the internal package. Additionally, I utilize the [torchinfo](https://github.com/TylerYep/torchinfo) package to show you the details of our loaded CLIP model. Note that this allowed me to observe the following: > Comment in our previous paper we did **not** load the 336px version of the `ViT-L/14` model. If we wanted to do that, we should have explicitly loaded the model as: `ViT-L/14@336px`.```{python}import sys, ossys.path.insert(0, os.path.abspath("."))## Import our functions/packagefrom conformal_clip import ( get_image_urls, load_image, evaluate_zero_shot_predictions, CLIPWrapper, encode_and_normalize, few_shot_fault_classification_conformal, compute_classification_metrics, compute_conformal_set_metrics, make_true_labels_from_counts)print("Package imported.")## Load CLIP modelimport torchimport clip # pip install git+https://github.com/openai/CLIP.gitfrom torchinfo import summarydevice ="cuda"if torch.cuda.is_available() else"cpu"model, preprocess = clip.load("ViT-L/14", device=device)vision_sum = summary( model.visual, verbose=0, input_size=(1, 3, 224, 224), # batch, channels, H, W)print("-----------------------------------------------")print("Comment in our previous paper we did not load the 336px version of the\nViT-L/14 model. If we wanted to do that, we should have explicitly loaded\nthe model as: ViT-L/14@336px.")print(vision_sum)```---# The "Stochastic Textured Surfaces" Experiment## Loading the Images```{python load_images_sts}# --- Load and split local images --------------------import os, globimport numpy as npfrom conformal_clip.image_io import load_image, sample_urls# Base folderbase_dir = os.path.join("data", "textile_images", "simulated")nominal_dir = os.path.join(base_dir, "nominal")local_dir = os.path.join(base_dir, "local")global_dir = os.path.join(base_dir, "global")# Gather file paths for common image extensions (no new helper function)exts = ["jpg", "jpeg", "png"]nominal_paths =sum([glob.glob(os.path.join(nominal_dir, f"*.{e}")) for e in exts], [])local_paths =sum([glob.glob(os.path.join(local_dir, f"*.{e}")) for e in exts], [])global_paths =sum([glob.glob(os.path.join(global_dir, f"*.{e}")) for e in exts], [])# Reproducible RNGrng = np.random.default_rng(2024)# --------------------------# (1) Test set (same size as paper): 100 images# 50 Nominal, 25 Local, 25 Global# --------------------------test_size =100test_nominal_paths, remaining_nominal_paths = sample_urls(nominal_paths, test_size, rng)test_global_paths, remaining_global_paths = sample_urls(global_paths, test_size //2, rng)test_local_paths, remaining_local_paths = sample_urls(local_paths, test_size //2, rng)test_defective_paths = test_global_paths + test_local_pathstest_paths = test_nominal_paths + test_defective_pathstest_image_filenames = [os.path.basename(p) for p in test_paths]# ------------------------------------------------# (2) Training (few-shot): 50 per class total# Defective = 25 global + 25 local# ------------------------------------------------train_nominal_n =50train_def_global_n =25train_def_local_n =25learn_nominal_paths, remaining_nominal_paths = sample_urls(remaining_nominal_paths, train_nominal_n, rng)learn_global_paths, remaining_global_paths = sample_urls(remaining_global_paths, train_def_global_n, rng)learn_local_paths, remaining_local_paths = sample_urls(remaining_local_paths, train_def_local_n, rng)learn_defective_paths = learn_global_paths + learn_local_paths# ------------------------------------------------# (3) Calibration: same sizes as training# ------------------------------------------------cal_nominal_paths, remaining_nominal_paths = sample_urls(remaining_nominal_paths, train_nominal_n, rng)cal_global_paths, remaining_global_paths = sample_urls(remaining_global_paths, train_def_global_n, rng)cal_local_paths, remaining_local_paths = sample_urls(remaining_local_paths, train_def_local_n, rng)cal_defective_paths = cal_global_paths + cal_local_paths# ------------------------------------------------# (4) Descriptions for few-shot references# ------------------------------------------------learn_nominal_fns = [os.path.basename(p) for p in learn_nominal_paths]learn_global_fns = [os.path.basename(p) for p in learn_global_paths]learn_local_fns = [os.path.basename(p) for p in learn_local_paths]nominal_descriptions = [f"Image {fn}: nominal textile, consistent weave, no visible defects."for fn in learn_nominal_fns]global_descriptions = [f"Image {fn}: global distortion, uniform shift across texture."for fn in learn_global_fns]local_descriptions = [f"Image {fn}: localized defect disrupting weave pattern."for fn in learn_local_fns]defective_descriptions = global_descriptions + local_descriptions# ------------------------------------------------# (5) Load and preprocess images# ------------------------------------------------nominal_images = [preprocess(load_image(p)).unsqueeze(0).to(device) for p in learn_nominal_paths]defective_images = [preprocess(load_image(p)).unsqueeze(0).to(device) for p in learn_defective_paths]calib_nominal_images = [preprocess(load_image(p)).unsqueeze(0).to(device) for p in cal_nominal_paths]calib_defective_images = [preprocess(load_image(p)).unsqueeze(0).to(device) for p in cal_defective_paths]calib_images = calib_nominal_images + calib_defective_imagescalib_labels = ["Nominal"] *len(calib_nominal_images) + ["Defective"] *len(calib_defective_images)test_nominal_images = [preprocess(load_image(p)).unsqueeze(0).to(device) for p in test_nominal_paths]test_defective_images = [preprocess(load_image(p)).unsqueeze(0).to(device) for p in test_defective_paths]test_images = test_nominal_images + test_defective_images# ------------------------------------------------# (6) Bookkeeping for downstream metrics# ------------------------------------------------labels = ["Nominal", "Defective"]label_counts = [len(test_nominal_images), len(test_defective_images)]print(f"Train few-shot: Nominal={len(nominal_images)}, Defective={len(defective_images)} | "f"Calib: Nominal={len(calib_nominal_images)}, Defective={len(calib_defective_images)} | "f"Test: Nominal={len(test_nominal_images)}, Defective={len(test_defective_images)}")```## Run conformal few-shotDespite what is mentioned in the [sklearn manual](https://scikit-learn.org/stable/modules/calibration.html#isotonic), the isotonic calibration was much better for this example. See the results for the isotonic regression below.```{python conformal_sts}#| cache: trueresults = few_shot_fault_classification_conformal( model=model, test_images=test_images, test_image_filenames=test_image_filenames, nominal_images=nominal_images, nominal_descriptions=nominal_descriptions, defective_images=defective_images, defective_descriptions=defective_descriptions, calib_images=calib_images, calib_labels=calib_labels, alpha=0.1, temperature=1.0, mondrian=True, class_labels=("Nominal", "Defective"), csv_path="results", csv_filename="exp03_results_conformal.csv", print_one_liner=False, prob_calibration="isotonic", # or alternatively "sigmoid" allow_empty=False)```## Classification Metrics```{python classifcation_metrics_sts}#| error: truelabels = ["Nominal", "Defective"]label_counts = [100, 100]cls_metrics = compute_classification_metrics("results/exp03_results_conformal.csv", labels=labels, label_counts=label_counts, save_confusion_matrix=True, cm_file_path="results", cm_file_name="exp03_conf_matrix.png", cm_title ="Confusion Matrix for Experiment 03 (Isotonic)")cls_metrics.style.hide()```## Conformal Set Metrics```{python conformal_metrics_sts}#| error: trueconf_metrics = compute_conformal_set_metrics("results/exp03_results_conformal.csv", labels=labels, label_counts=label_counts)conf_metrics.style.hide()```