Intro

This project is about getting better zero-shot Classification of images and text using CV/LLM models without spending time and money fine-tuning in training, or re-running models in inference. It uses a novel dimensionality reduction technique on embeddings and determines classes using tournament style pair-wise comparison. It resulted in an increase in text/image agreement from 61% to 89% for a 50k dataset over 13 classes.

https://github.com/doc1000/pairwise_classification

Where you will use it

The practical application is in large-scale class search where speed of inference is important and model cost spend is a concern. It is also useful in finding errors in your annotation process — misclassifications in a large database.

Results

The weighted F1 score comparing the text and image class agreement went from 61% to 88% for ~50k items across 13 classes. A visual inspection also validated the results.

F1_score (weighted) base model pairwise
Multiclass 0.613 0.889
Binary 0.661 0.645
Focusing on the multi-class work, class count cohesion improves with the model. 
Left: Base, full embedding, argmax on cosine similarity model
Right: pairwise tourney model using feature sub-segments scored by crossratio
Image by author

Method: Pairwise comparison of cosine similarity of embedding sub-dimensions determined by mean-scale scoring

A straightforward way to vector classification is to compare image/text embeddings to class embeddings using cosine similarity. It’s relatively quick and requires minimal overhead. You can also run a classification model on the embeddings (logistic regressions, trees, svm) and target the class without further embeddings.

My approach was to reduce the feature size in the embeddings determining which feature distributions were substantially different between two classes, and thus contributed information with less noise. For scoring features, I used a derivation of variance that encompasses two distributions, which I refer to as cross-variance (more below). I used this to get important dimensions for the ‘clothing’ category (one-vs-the rest) and re-classified using the sub-features, which showed some improvement in model power. However, the sub-feature comparison showed better results when comparing classes pairwise (one vs one/head to head). Separately for images and text, I constructed an array-wide ‘tournament’ style bracket of pairwise comparisons, until a final class was determined for each item. It ends up being fairly efficient. I then scored the agreement between the text and image classifications.

Using cross variance, pair specific feature selection and pairwise tourney assignment.

All images by author unless stated otherwise in captions

I am using a product image database that was readily available with pre-calculated CLIP embeddings (thank you SQID (Cited below. This dataset is released under the MIT License), AMZN (Cited below. This dataset is licensed under Apache License 2.0) and targeting the clothing images because that is where I first saw this effect (thank you DS team at Nordstrom). The dataset was narrowed down from 150k items/images/descriptions to ~50k clothing items using zero shot classification, then the augmented classification based on targeted subarrays.

Test Statistic: Cross Variance

This is a method to determine how different the distribution is for two different classes when targeting a single feature/dimension. It is a measure of the combined average variance if each element of both distributions is dropped into the other distribution. It is an expansion of the math of variance/standard deviation, but between two distributions (that can be of varying size). I have not seen it used before, although it may be listed under a different moniker. 

Cross Variance:

Similar to variance, except summing over both distributions and taking a difference of each value instead of the mean of the single distribution. If you input the same distribution as A and B, then it yields the same results as variance.

This simplifies to:

This is equivalent to the alternate definition of variance (the mean of the squares minus the square of the mean) for a single distribution when the distributions i and j are equal. Using this version is massively faster and more memory efficient than attempting to broadcast the arrays directly. I will provide the proof and go into more detail in another write-up. Cross deviation(ς) is the square root of undefined.

To score features, I use a ratio. The numerator is cross variance. The denominator is the product of ij, same as the denominator of Pearson correlation. Then I take the root (I could just as easily use cross variance, which would compare more directly with covariance, but I’ve found the ratio to be more compact and interpretable using cross dev).

I interpret this as the increased combined standard deviation if you swapped classes for each item. A large number means the feature distribution is likely quite different for the two classes.

For an embedding feature with low cross gain, the difference in distributions will be minimal… there is very little information lost if you transfer an item from one class to the other. However, for a feature with high cross gain relative to these two classes, there is a large difference in the distribution of feature values… in this case both in mean and variance. The high cross gain feature provides much more information.
Image by author

This is an alternative mean-scale difference Ks_test; Bayesian 2dist tests and Frechet Inception Distance are alternatives. I like the elegance and novelty of cross var. I will likely follow up by looking at other differentiators. I should note that determining distributional differences for a normalized feature with overall mean 0 and sd = 1 is its own challenge.

Sub-dimensions: dimensionality reduction of embedding space for classification

When you are trying to find a particular characteristic of an image, do you need the whole embedding? Is color or whether something is a shirt or pair of pants located in a narrow section of the embedding? If I’m looking for a shirt, I don’t necessarily care if it’s blue or red, so I just look at the dimensions that define ‘shirtness’ and throw out the dimensions that define color.

The red highlighted dimensions demonstrate importance when determining if an image contains clothing. We focus on those dimensions when attempting to classify.
Image by author

I am taking a [n,768] dimensional embedding and narrowing it down to closer to 100 dimensions that actually matter for a particular class pair. Why? Because the cosine similarity metric (cosim) gets influenced by the noise of the relatively unimportant features. The embedding carries a tremendous amount of information, much of which you simply don’t care about in a classification problem. Get rid of the noise and the signal gets stronger: cosim increases with elimination of ‘unimportant’ dimensions.

In the above, you can see that the average cosine similarity rises as the minimum feature cross ratio increases (corresponding to fewer features on the right), until it collapses because there are too few features. I used a cross ratio of 1.2 to balance increased fit with reduced information.
Image by author

For a pairwise comparisons, first split items into classes using standard cosine similarity applied to the full embedding. I exclude some items that show very low cosim on the assumption that the model skill is low for those items (cosim limit). I also exclude items that show low differentiation between the two classes (cosim diff). The result is two distributions upon which to extract important dimensions that should define the ‘true’ difference between the classifications:

The light blue dots represent images that seem more likely to contain clothing. The dark blue dots are non-clothing. The peach line going down the middle is an area of uncertainty, and is excluded from the next steps. Similarly, the dark dots are excluded because the model does not have a lot of confidence in classifying them at all. Our objective is to isolate the two classes, extract the features that differentiate them, then determine if there is agreement between the image and text models.
Image by author

Array Pairwise Tourney Classification

Getting a global class assignment out of pairwise comparisons requires some thought. You can take the given assignment and compare just that class to all the others. If there was good skill in the initial assignment, this should work well, but if multiple alternate classes are superior, you run into trouble. A cartesian approach where you compare all vs all would get you there, but would get big quickly. I settled on an array-wide ‘tournament’ style bracket of pairwise comparisons.

This has log_2 (#classes) rounds and total number of comparisons maxing at summation_round(combo(#classes in round)*n_items) across some specified # of features. I randomize the ordering of ‘teams’ each round so the comparisons aren’t the same each time. It has some match up risk but gets to a winner quickly. It’s built to handle an array of comparisons at each round, rather than iterating over items.

Scoring

Finally, I scored the process by determining if the classification from text and images match. As long as the distribution isn’t heavily overweight towards a ‘default’ class (it is not), this should be a good assessment of whether the process is pulling real information out of the embeddings. 

I looked at the weighted F1 score comparing the classes assigned using the image vs the text description. The assumption the better the agreement, the more likely the classification is correct. For my dataset of ~50k images and text descriptions of clothing with 13 classes, the starting score of the simple full-embedding cosine similarity model went from 42% to 55% for the sub-feature cosim, to 89% for the pairwise model with sub-features.. A visual inspection also validated the results. The binary classification wasn’t the primary goal – it was largely to get a sub-segment of the data to then test multi-class boosting.

base model pairwise
Multiclass 0.613 0.889
Binary 0.661 0.645
The combined confusion matrix shows tighter match between image and text. Note top end of scaling is higher in the right chart and there are fewer blocks with split assignments.
Image by author
Similarly, the combined confusion matrix shows tighter match between image and text. For a given text class (bottom), there is larger agreement with the image class in the pairwise model. This also highlights the size of the classes based on the width of the columns
Image by author using code from Nils Flaschel

Final Thoughts…

This may be a good method for finding errors in large subsets of annotated data, or doing zero shot labeling without extensive extra GPU time for fine tuning and training. It introduces some novel scoring and approaches, but the overall process is not overly complicated or CPU/GPU/memory intensive. 

Follow up will be applying it to other image/text datasets as well as annotated/categorized image or text datasets to determine if scoring is boosted. In addition, it would be interesting to determine whether the boost in zero shot classification for this dataset changes substantially if:

  1.  Other scoring metrics are used instead of cross deviation ratio
  2. Full feature embeddings are substituted for targeted features
  3. Pairwise tourney is replaced by another approach

I hope you find it useful.

Citations

@article{reddy2022shopping,title={Shopping Queries Dataset: A Large-Scale {ESCI} Benchmark for Improving Product Search},author={Chandan K. Reddy and Lluís Màrquez and Fran Valero and Nikhil Rao and Hugo Zaragoza and Sambaran Bandyopadhyay and Arnab Biswas and Anlu Xing and Karthik Subbian},year={2022},eprint={2206.06588},archivePrefix={arXiv}}

Shopping Queries Image Dataset (SQID): An Image-Enriched ESCI Dataset for Exploring Multimodal Learning in Product Search, M. Al Ghossein, C.W. Chen, J. Tang

Share.

Comments are closed.