Intro
This project is about getting better zero-shot Classification of images and text using CV/LLM models without spending time and money fine-tuning in training, or re-running models in inference. It uses a novel dimensionality reduction technique on embeddings and determines classes using tournament style pair-wise comparison. It resulted in an increase in text/image agreement from 61% to 89% for a 50k dataset over 13 classes.
https://github.com/doc1000/pairwise_classification
Where you will use it
The practical application is in large-scale class search where speed of inference is important and model cost spend is a concern. It is also useful in finding errors in your annotation process — misclassifications in a large database.
Results
The weighted F1 score comparing the text and image class agreement went from 61% to 88% for ~50k items across 13 classes. A visual inspection also validated the results.
F1_score (weighted) | base model | pairwise |
Multiclass | 0.613 | 0.889 |
Binary | 0.661 | 0.645 |
Left: Base, full embedding, argmax on cosine similarity model
Right: pairwise tourney model using feature sub-segments scored by crossratio
Image by author
Method: Pairwise comparison of cosine similarity of embedding sub-dimensions determined by mean-scale scoring
A straightforward way to vector classification is to compare image/text embeddings to class embeddings using cosine similarity. It’s relatively quick and requires minimal overhead. You can also run a classification model on the embeddings (logistic regressions, trees, svm) and target the class without further embeddings.
My approach was to reduce the feature size in the embeddings determining which feature distributions were substantially different between two classes, and thus contributed information with less noise. For scoring features, I used a derivation of variance that encompasses two distributions, which I refer to as cross-variance (more below). I used this to get important dimensions for the ‘clothing’ category (one-vs-the rest) and re-classified using the sub-features, which showed some improvement in model power. However, the sub-feature comparison showed better results when comparing classes pairwise (one vs one/head to head). Separately for images and text, I constructed an array-wide ‘tournament’ style bracket of pairwise comparisons, until a final class was determined for each item. It ends up being fairly efficient. I then scored the agreement between the text and image classifications.
Using cross variance, pair specific feature selection and pairwise tourney assignment.

I am using a product image database that was readily available with pre-calculated CLIP embeddings (thank you SQID (Cited below. This dataset is released under the MIT License), AMZN (Cited below. This dataset is licensed under Apache License 2.0) and targeting the clothing images because that is where I first saw this effect (thank you DS team at Nordstrom). The dataset was narrowed down from 150k items/images/descriptions to ~50k clothing items using zero shot classification, then the augmented classification based on targeted subarrays.

Test Statistic: Cross Variance
This is a method to determine how different the distribution is for two different classes when targeting a single feature/dimension. It is a measure of the combined average variance if each element of both distributions is dropped into the other distribution. It is an expansion of the math of variance/standard deviation, but between two distributions (that can be of varying size). I have not seen it used before, although it may be listed under a different moniker.
Cross Variance:

Similar to variance, except summing over both distributions and taking a difference of each value instead of the mean of the single distribution. If you input the same distribution as A and B, then it yields the same results as variance.
This simplifies to:

This is equivalent to the alternate definition of variance (the mean of the squares minus the square of the mean) for a single distribution when the distributions i and j are equal. Using this version is massively faster and more memory efficient than attempting to broadcast the arrays directly. I will provide the proof and go into more detail in another write-up. Cross deviation(ς) is the square root of undefined.
To score features, I use a ratio. The numerator is cross variance. The denominator is the product of ij, same as the denominator of Pearson correlation. Then I take the root (I could just as easily use cross variance, which would compare more directly with covariance, but I’ve found the ratio to be more compact and interpretable using cross dev).

I interpret this as the increased combined standard deviation if you swapped classes for each item. A large number means the feature distribution is likely quite different for the two classes.

Image by author
This is an alternative mean-scale difference Ks_test; Bayesian 2dist tests and Frechet Inception Distance are alternatives. I like the elegance and novelty of cross var. I will likely follow up by looking at other differentiators. I should note that determining distributional differences for a normalized feature with overall mean 0 and sd = 1 is its own challenge.
Sub-dimensions: dimensionality reduction of embedding space for classification
When you are trying to find a particular characteristic of an image, do you need the whole embedding? Is color or whether something is a shirt or pair of pants located in a narrow section of the embedding? If I’m looking for a shirt, I don’t necessarily care if it’s blue or red, so I just look at the dimensions that define ‘shirtness’ and throw out the dimensions that define color.

Image by author
I am taking a [n,768] dimensional embedding and narrowing it down to closer to 100 dimensions that actually matter for a particular class pair. Why? Because the cosine similarity metric (cosim) gets influenced by the noise of the relatively unimportant features. The embedding carries a tremendous amount of information, much of which you simply don’t care about in a classification problem. Get rid of the noise and the signal gets stronger: cosim increases with elimination of ‘unimportant’ dimensions.

Image by author
For a pairwise comparisons, first split items into classes using standard cosine similarity applied to the full embedding. I exclude some items that show very low cosim on the assumption that the model skill is low for those items (cosim limit). I also exclude items that show low differentiation between the two classes (cosim diff). The result is two distributions upon which to extract important dimensions that should define the ‘true’ difference between the classifications:

Image by author
Array Pairwise Tourney Classification
Getting a global class assignment out of pairwise comparisons requires some thought. You can take the given assignment and compare just that class to all the others. If there was good skill in the initial assignment, this should work well, but if multiple alternate classes are superior, you run into trouble. A cartesian approach where you compare all vs all would get you there, but would get big quickly. I settled on an array-wide ‘tournament’ style bracket of pairwise comparisons.

This has log_2 (#classes) rounds and total number of comparisons maxing at summation_round(combo(#classes in round)*n_items) across some specified # of features. I randomize the ordering of ‘teams’ each round so the comparisons aren’t the same each time. It has some match up risk but gets to a winner quickly. It’s built to handle an array of comparisons at each round, rather than iterating over items.
Scoring
Finally, I scored the process by determining if the classification from text and images match. As long as the distribution isn’t heavily overweight towards a ‘default’ class (it is not), this should be a good assessment of whether the process is pulling real information out of the embeddings.
I looked at the weighted F1 score comparing the classes assigned using the image vs the text description. The assumption the better the agreement, the more likely the classification is correct. For my dataset of ~50k images and text descriptions of clothing with 13 classes, the starting score of the simple full-embedding cosine similarity model went from 42% to 55% for the sub-feature cosim, to 89% for the pairwise model with sub-features.. A visual inspection also validated the results. The binary classification wasn’t the primary goal – it was largely to get a sub-segment of the data to then test multi-class boosting.
base model | pairwise | |
Multiclass | 0.613 | 0.889 |
Binary | 0.661 | 0.645 |

Image by author

Image by author using code from Nils Flaschel
Final Thoughts…
This may be a good method for finding errors in large subsets of annotated data, or doing zero shot labeling without extensive extra GPU time for fine tuning and training. It introduces some novel scoring and approaches, but the overall process is not overly complicated or CPU/GPU/memory intensive.
Follow up will be applying it to other image/text datasets as well as annotated/categorized image or text datasets to determine if scoring is boosted. In addition, it would be interesting to determine whether the boost in zero shot classification for this dataset changes substantially if:
- Other scoring metrics are used instead of cross deviation ratio
- Full feature embeddings are substituted for targeted features
- Pairwise tourney is replaced by another approach
I hope you find it useful.
Citations
@article{reddy2022shopping,title={Shopping Queries Dataset: A Large-Scale {ESCI} Benchmark for Improving Product Search},author={Chandan K. Reddy and Lluís Màrquez and Fran Valero and Nikhil Rao and Hugo Zaragoza and Sambaran Bandyopadhyay and Arnab Biswas and Anlu Xing and Karthik Subbian},year={2022},eprint={2206.06588},archivePrefix={arXiv}}
Shopping Queries Image Dataset (SQID): An Image-Enriched ESCI Dataset for Exploring Multimodal Learning in Product Search, M. Al Ghossein, C.W. Chen, J. Tang