ConvNet vs Transformer, Supervised vs CLIP: Beyond ImageNet Accuracy

Kirill Vishniakov*, Zhiqiang Shen, Zhuang Liu

*Corresponding author for this work

Research output: Contribution to journalConference article published in journalpeer-review

1 Citation (Scopus)

Abstract

Modern computer vision offers a great variety of models to practitioners, and selecting a model from multiple options for specific applications can be challenging. Conventionally, competing model architectures and training protocols are compared by their classification accuracy on ImageNet. However, this single metric does not fully capture performance nuances critical for specialized tasks. In this work, we conduct an in-depth comparative analysis of model behaviors beyond ImageNet accuracy, for both ConvNet and Vision Transformer architectures, each across supervised and CLIP training paradigms. Although our selected models have similar ImageNet accuracies and compute requirements, we find that they differ in many other aspects: types of mistakes, output calibration, transferability, and feature invariance, among others. This diversity in model characteristics, not captured by traditional metrics, highlights the need for more nuanced analysis when choosing among different models. Code is available at github.com/kirill-vish/Beyond-INet.

Original languageEnglish
Pages (from-to)49545-49557
Number of pages13
JournalProceedings of Machine Learning Research
Volume235
Publication statusPublished - 2024
Externally publishedYes
Event41st International Conference on Machine Learning, ICML 2024 - Vienna, Austria
Duration: 21 Jul 202427 Jul 2024

Bibliographical note

Publisher Copyright:
Copyright 2024 by the author(s)

Fingerprint

Dive into the research topics of 'ConvNet vs Transformer, Supervised vs CLIP: Beyond ImageNet Accuracy'. Together they form a unique fingerprint.

Cite this