Image Denoising

SIDD - PSNR (sRGB)

Multi-task Language Understanding

MMLU - Average (%)

Monocular Depth Estimation

NYU-Depth V2 - RMSE

Monocular Depth Estimation

NYU-Depth V2 - absolute relative error

Monocular Depth Estimation

NYU-Depth V2 - Delta < 1.25

Monocular Depth Estimation

NYU-Depth V2 - Delta < 1.25^2

Monocular Depth Estimation

NYU-Depth V2 - Delta < 1.25^3

Monocular Depth Estimation

NYU-Depth V2 - log 10

Data-to-Text Generation

E2E NLG Challenge - METEOR

Crowd Counting

ShanghaiTech A - MAE

Crowd Counting

ShanghaiTech A - MSE

Scene Text Recognition

ICDAR2013 - Accuracy

Video Quality Assessment

LIVE-FB LSVQ - PLCC

Zero-Shot Video Question Answer

NExT-QA - Accuracy

Weakly Supervised Action Localization

ActivityNet-1.2 - Mean mAP

Unsupervised Semantic Segmentation with Language-image Pre-training

Cityscapes val - mIoU

Unsupervised Semantic Segmentation with Language-image Pre-training

COCO-Stuff-171 - mIoU

Unsupervised Semantic Segmentation with Language-image Pre-training

PascalVOC-20 - mIoU

Unsupervised Semantic Segmentation with Language-image Pre-training

PASCAL VOC - mIoU

Image Dehazing

SOTS Outdoor - PSNR

Image-to-Image Translation

ADE20K Labels-to-Photos - mIoU

Image-to-Image Translation

ADE20K Labels-to-Photos - FID

Image-to-Image Translation

Cityscapes Labels-to-Photo - mIoU

Image-to-Image Translation

Cityscapes Labels-to-Photo - FID

Image-to-Image Translation

COCO-Stuff Labels-to-Photos - FID

Video Panoptic Segmentation

VIPSeg - VPQ

Video Panoptic Segmentation

VIPSeg - STQ

Unsupervised Semantic Segmentation with Language-image Pre-training

COCO-Object - mIoU

Multiview Detection

MultiviewX - MODA

Multiview Detection

MultiviewX - Recall

Multiview Detection

Wildtrack - MODA

Multiview Detection

Wildtrack - Recall

Chart Question Answering

ChartQA - 1:1 Accuracy

Robust Object Detection

DWD - mPC [AP50]

3D Instance Segmentation

ScanNet200 - mAP

Few-Shot 3D Point Cloud Classification

ModelNet40 10-way (10-shot) - Overall Accuracy

Referring Video Object Segmentation

Refer-YouTube-VOS - J&F

Referring Video Object Segmentation

Refer-YouTube-VOS - J

Referring Video Object Segmentation

Refer-YouTube-VOS - F

Multiple Object Tracking

KITTI Tracking test - MOTA

Multiple Object Tracking

KITTI Tracking test - HOTA

Multi-Object Tracking

TAO - TETA

Multi-Object Tracking

TAO - LocA

Multi-Object Tracking

TAO - AssocA

Referring Expression Segmentation

Refer-YouTube-VOS (2021 public validation) - J&F

Referring Expression Segmentation

Refer-YouTube-VOS (2021 public validation) - J

Referring Expression Segmentation

Refer-YouTube-VOS (2021 public validation) - F

Column Type Annotation

VizNet-Sato-Full - Macro-F1

Zero-Shot Video Question Answer

MSRVTT-QA - Accuracy

Multi-Person Pose Estimation

CrowdPose - mAP @0.5:0.95

Multi-Person Pose Estimation

CrowdPose - AP Easy

Multi-Person Pose Estimation

CrowdPose - AP Medium

Multi-Person Pose Estimation

CrowdPose - AP Hard

Semi-Supervised Object Detection

COCO 1% labeled data - mAP

Semi-Supervised Object Detection

COCO 2% labeled data - mAP

Semi-Supervised Object Detection

COCO 100% labeled data - mAP

Semi-Supervised Object Detection

COCO 10% labeled data - mAP

Science Question Answering

ScienceQA - Social Science

Science Question Answering

ScienceQA - Language Science

Science Question Answering

ScienceQA - Image Context

Science Question Answering

ScienceQA - No Context

Science Question Answering

ScienceQA - Grades 1-6

Science Question Answering

ScienceQA - Grades 7-12

Science Question Answering

ScienceQA - Avg. Accuracy

Time Series Forecasting

Electricity (720) - MSE

Time Series Forecasting

Electricity (336) - MSE

Facial Expression Recognition (FER)

AffectNet - Accuracy (7 emotion)

Domain Generalization

GTA-to-Avg(Cityscapes,BDD,Mapillary) - mIoU

Text-based Image Editing

PIE-Bench - Background PSNR

Text-based Image Editing

PIE-Bench - Background LPIPS

Weakly-Supervised Semantic Segmentation

PASCAL VOC 2012 train - Mean IoU

Image Generation

ImageNet 512x512 - FID

Image Manipulation Detection

DSO-1 - Balanced Accuracy

Image Manipulation Detection

Casia V1+ - Balanced Accuracy

Image Manipulation Detection

COVERAGE - AUC

Image Manipulation Detection

COVERAGE - Balanced Accuracy

Image Manipulation Detection

CocoGlide - Balanced Accuracy

3D Question Answering (3D-QA)

ScanQA Test w/ objects - BLEU-1

3D Question Answering (3D-QA)

ScanQA Test w/ objects - BLEU-4

3D Question Answering (3D-QA)

ScanQA Test w/ objects - ROUGE

3D Question Answering (3D-QA)

ScanQA Test w/ objects - METEOR

3D Question Answering (3D-QA)

ScanQA Test w/ objects - CIDEr

Referring Expression Segmentation

RefCOCO+ test B - Overall IoU

Referring Expression Segmentation

RefCOCO+ testA - Overall IoU

Referring Expression Segmentation

RefCOCO+ val - Overall IoU

Referring Expression Segmentation

RefCOCOg-val - Overall IoU

Referring Expression Segmentation

RefCOCOg-test - Overall IoU

Natural Language Moment Retrieval

TACoS - R@1,IoU=0.3

Natural Language Moment Retrieval

TACoS - R@1,IoU=0.5

Natural Language Moment Retrieval

TACoS - R@1,IoU=0.7

Natural Language Moment Retrieval

TACoS - mIoU

Egocentric Pose Estimation

GlobalEgoMocap Test Dataset - Average MPJPE (mm)

Egocentric Pose Estimation

GlobalEgoMocap Test Dataset - PA-MPJPE

Egocentric Pose Estimation

SceneEgo - Average MPJPE (mm)

Egocentric Pose Estimation

SceneEgo - PA-MPJPE

Zero-Shot Video Question Answer

TVQA - Accuracy

Zero-Shot Video Question Answer

ActivityNet-QA - Accuracy

Zero-Shot Video Question Answer

STAR: Situated Reasoning - Accuracy

Video Retrieval

VATEX - text-to-video R@50

Panoptic Scene Graph Generation

PSG Dataset - mR@20

Semi-Supervised Semantic Segmentation

PASCAL VOC 2012 732 labeled - Validation mIoU

Semi-Supervised Semantic Segmentation

PASCAL VOC 2012 366 labeled - Validation mIoU

Semi-Supervised Semantic Segmentation

PASCAL VOC 2012 183 labeled - Validation mIoU

Semi-Supervised Semantic Segmentation

ADE20K 1/32 labeled - Validation mIoU

Semi-Supervised Semantic Segmentation

PASCAL VOC 2012 1464 labels - Validation mIoU

Semi-Supervised Semantic Segmentation

PASCAL VOC 2012 92 labeled - Validation mIoU

Semi-Supervised Semantic Segmentation

COCO 1/32 labeled - Validation mIoU

Semi-Supervised Semantic Segmentation

ADE20K 1/16 labeled - Validation mIoU

Semi-Supervised Semantic Segmentation

Cityscapes 6.25% labeled - Validation mIoU

Semi-Supervised Semantic Segmentation

COCO 1/128 labeled - Validation mIoU

Semi-Supervised Semantic Segmentation

COCO 1/512 labeled - Validation mIoU

Semi-Supervised Semantic Segmentation

COCO 1/256 labeled - Validation mIoU

Semi-Supervised Semantic Segmentation

COCO 1/64 labeled - Validation mIoU

Semi-Supervised Semantic Segmentation

Cityscapes 100 samples labeled - Validation mIoU

Anomaly Detection In Surveillance Videos

UCF-Crime - ROC AUC

3D Instance Segmentation

ScanNet(v2) - mAP @ 50

3D Instance Segmentation

ScanNet(v2) - mAP@25

Synthetic-to-Real Translation

SYNTHIA-to-Cityscapes - MIoU (13 classes)

Synthetic-to-Real Translation

SYNTHIA-to-Cityscapes - MIoU (16 classes)

GZSL Video Classification

ActivityNet-GZSL(main) - HM

GZSL Video Classification

ActivityNet-GZSL(main) - ZSL

GZSL Video Classification

VGGSound-GZSL(main) - HM

GZSL Video Classification

VGGSound-GZSL(main) - ZSL

GZSL Video Classification

VGGSound-GZSL (cls) - HM

GZSL Video Classification

VGGSound-GZSL (cls) - ZSL

GZSL Video Classification

UCF-GZSL(main) - HM

GZSL Video Classification

UCF-GZSL(main) - ZSL

GZSL Video Classification

UCF-GZSL (cls) - HM

GZSL Video Classification

UCF-GZSL (cls) - ZSL

GZSL Video Classification

ActivityNet-GZSL (cls) - HM

GZSL Video Classification

ActivityNet-GZSL (cls) - ZSL

visual instruction following

LLaVA-Bench - avg score

Low-Dose X-Ray Ct Reconstruction

X3D - PSNR

Low-Dose X-Ray Ct Reconstruction

X3D - SSIM

Multi-task Language Understanding

BBH-nlp - Average (%)

Zero-Shot Video Question Answer

MSRVTT-QA - Confidence Score

Zero-Shot Video Question Answer

MSVD-QA - Confidence Score

Zero-Shot Video Question Answer

TGIF-QA - Accuracy

Zero-Shot Video Question Answer

TGIF-QA - Confidence Score

Multiple Object Tracking

BDD100K test - mMOTA

Zero-Shot Video Question Answer

ActivityNet-QA - Confidence Score

Video-based Generative Performance Benchmarking (Consistency)

VideoInstruct - gpt-score

Zero-Shot Composed Image Retrieval (ZS-CIR)

Fashion IQ - (Recall@10+Recall@50)/2

Sports Ball Detection and Tracking

Volleyball - F1 (%)

Sports Ball Detection and Tracking

Volleyball - Accuracy (%)

Sports Ball Detection and Tracking

Volleyball - Average Precision (%)

Sports Ball Detection and Tracking

Soccer - F1 (%)

Sports Ball Detection and Tracking

Soccer - Average Precision (%)

Sports Ball Detection and Tracking

Soccer - Accuracy (% )

Sports Ball Detection and Tracking

Tennis - F1 (%)

Sports Ball Detection and Tracking

Tennis - Accuracy (%)

Sports Ball Detection and Tracking

Tennis - Average Precision (%)

Sports Ball Detection and Tracking

Basketball - F1 (%)

Sports Ball Detection and Tracking

Basketball - Accuracy (%)

Sports Ball Detection and Tracking

Basketball - Average Precision (%)

Sports Ball Detection and Tracking

Badminton - F1 (%)

Sports Ball Detection and Tracking

Badminton - Accuracy (%)

Sports Ball Detection and Tracking

Badminton - Average Precision (%)

Multimodal Emotion Recognition

IEMOCAP - F1

Multimodal Emotion Recognition

IEMOCAP - Weighted Accuracy (WA)

Action Classification

MiT - Top 1 Accuracy

Visual Question Answering

MM-Vet - GPT-4 score

Image Dehazing

SOTS Outdoor - SSIM

Video Anomaly Detection

HR-ShanghaiTech - AUC

Video Anomaly Detection

HR-Avenue - AUC

Camouflaged Object Segmentation

COD - MAE

Camouflaged Object Segmentation

COD - Weighted F-Measure

Camouflaged Object Segmentation

COD - S-Measure

Video Object Detection

ImageNet VID - MAP

Video Retrieval

Condensed Movies - text-to-video R@1

Video Retrieval

Condensed Movies - text-to-video R@5

Video Retrieval

Condensed Movies - text-to-video R@10

Video Retrieval

QuerYD - text-to-video R@1

Video Retrieval

QuerYD - text-to-video R@10

Video Retrieval

QuerYD - text-to-video R@5

Multi-Label Image Classification

BigEarthNet (official test set) - F1 Score

Aspect-Based Sentiment Analysis (ABSA)

SemEval 2014 Task 4 Subtask 1+2 - F1

Low-Light Image Enhancement

LOL - Average PSNR

Low-Light Image Enhancement

LOLv2 - LPIPS

Hand Gesture Recognition

LSA16 - Accuracy

Retinal Vessel Segmentation

DRIVE - mIoU

Retinal Vessel Segmentation

CHASE_DB1 - mIOU

Visual Question Answering

VQA v2 test-dev - Accuracy

No-Reference Image Quality Assessment

CSIQ - PLCC

Analog Video Restoration

TAPE - LPIPS

Analog Video Restoration

TAPE - PSNR

Semi-Supervised Video Object Segmentation

YouTube-VOS 2019 - Overall

Semi-Supervised Video Object Segmentation

YouTube-VOS 2019 - F-Measure (Unseen)

Semi-Supervised Video Object Segmentation

MOSE - J&F

Semi-Supervised Video Object Segmentation

MOSE - J

Semi-Supervised Video Object Segmentation

MOSE - F

Vehicle Re-Identification

VeRi-Wild Small - mAP

Conditional Image Generation

ImageNet 128x128 - FID

Domain Generalization

TerraIncognita - Average Accuracy

No-Reference Image Quality Assessment

TID2013 - SRCC

No-Reference Image Quality Assessment

TID2013 - PLCC

No-Reference Image Quality Assessment

KADID-10k - SRCC

No-Reference Image Quality Assessment

KADID-10k - PLCC

No-Reference Image Quality Assessment

CSIQ - SRCC

Visual Question Answering

BenchLMM - GPT-3.5 score

Knowledge Base Question Answering

WebQuestionsSP - Hits@1

Knowledge Base Question Answering

ComplexWebQuestions - Accuracy

Zero-Shot Composed Image Retrieval (ZS-CIR)

CIRCO - mAP@10

Unsupervised Semantic Segmentation

COCO-Stuff-81 - mIoU

Unsupervised Semantic Segmentation

COCO-Stuff-81 - Pixel Accuracy

3D Lane Detection

OpenLane-V2 val - DET_l

3D Lane Detection

OpenLane-V2 val - TOP_lt

3D Object Detection

V2XSet - AP0.7 (Perfect)

Multimodal Sentiment Analysis

CMU-MOSI - Acc-7

Zero-Shot Object Detection

PASCAL VOC'07 - mAP

Zero-Shot Object Detection

MS-COCO - mAP

3D Object Detection

DAIR-V2X-I - AP|R40(moderate)

3D Object Detection

DAIR-V2X-I - AP|R40(easy)

3D Object Detection

DAIR-V2X-I - AP|R40(hard)

3D Object Detection

Rope3D - [email protected]

Semi-Supervised Image Classification

CIFAR-100, 400 Labels - Percentage error

Low-Light Image Enhancement

LOL - LPIPS

Single Image Deraining

Rain100H - SSIM

Single Image Deraining

Rain100H - PSNR

Vehicle Re-Identification

VeRi-776 - mAP

Vehicle Re-Identification

VeRi-776 - Rank-1

Vehicle Re-Identification

VehicleID Small - mAP

Type prediction

ManyTypes4TypeScript - Average Accuracy

Automated Theorem Proving

miniF2F-valid - Pass@100

Time Series Forecasting

ETTh2 (192) Multivariate - MSE

Time Series Forecasting

ETTh2 (192) Multivariate - MAE

Time Series Forecasting

ETTh2 (96) Univariate - MSE

Time Series Forecasting

ETTh2 (336) Multivariate - MSE

Time Series Forecasting

ETTh2 (336) Multivariate - MAE

Time Series Forecasting

ETTh1 (192) Multivariate - MSE

Time Series Forecasting

ETTh1 (192) Multivariate - MAE

Time Series Forecasting

ETTh2 (192) Univariate - MSE

Time Series Forecasting

ETTh2 (192) Univariate - MAE

Time Series Forecasting

ETTh1 (192) Univariate - MSE

Time Series Forecasting

ETTh1 (192) Univariate - MAE

Time Series Forecasting

ETTh1 (336) Multivariate - MSE

Time Series Forecasting

ETTh1 (336) Multivariate - MAE

Time Series Forecasting

ETTh2 (96) Multivariate - MSE

Time Series Forecasting

ETTh2 (96) Multivariate - MAE

3D Multi-Person Mesh Recovery

AGORA - FB-NMVE

3D Multi-Person Mesh Recovery

AGORA - B-NMVE

3D Multi-Person Mesh Recovery

AGORA - FB-MVE

3D Multi-Person Mesh Recovery

AGORA - F-MVE

Video Retrieval

MSVD - video-to-text R@10

Open Vocabulary Object Detection

LVIS v1.0 - AP novel-LVIS base training

Open Vocabulary Object Detection

LVIS v1.0 - AP novel-Unrestricted open-vocabulary training

Image-text matching

CommercialAdsDataset - ADD(S) AUC

Photo geolocation estimation

GWS15k - City level (25 km)

Photo geolocation estimation

GWS15k - Region level (200 km)

Photo geolocation estimation

GWS15k - Country level (750 km)

Photo geolocation estimation

GWS15k - Continent level (2500 km)

Photo geolocation estimation

YFCC26k - Street level (1 km)

Photo geolocation estimation

YFCC26k - Region level (200 km)

Photo geolocation estimation

YFCC26k - Country level (750 km)

Photo geolocation estimation

YFCC26k - Continent level (2500 km)

Photo geolocation estimation

Im2GPS3k - Street level (1 km)

Photo geolocation estimation

Im2GPS3k - City level (25 km)

Photo geolocation estimation

Im2GPS3k - Region level (200 km)

Photo geolocation estimation

Im2GPS3k - Country level (750 km)