Video Anomaly Detection

HR-ShanghaiTech - AUC

Video Anomaly Detection

ShanghaiTech Campus - AUC

Speech Emotion Recognition

MSP-Podcast (Valence) - CCC

Speech Emotion Recognition

MSP-Podcast (Dominance) - CCC

Speech Emotion Recognition

MSP-Podcast (Activation) - CCC

Dynamic Link Prediction

DBLP Temporal - AUC

Dynamic Link Prediction

DBLP Temporal - AP

Point Cloud Registration

RotKITTI Registration Benchmark - RR@(1.5,0.3)

Point Cloud Registration

RotKITTI Registration Benchmark - RR@(1,0.1)

Image Segmentation

MAS3K - S-measure

Image Segmentation

MAS3K - mIoU

Image Segmentation

MAS3K - E-measure

Image Segmentation

MAS3K - MAE

Image Segmentation

RMAS - S-measure

Image Segmentation

MSD (Mirror Segmentation Dataset) - MAE

Image Segmentation

MSD (Mirror Segmentation Dataset) - IoU

Image Segmentation

MSD (Mirror Segmentation Dataset) - F-measure

Image Segmentation

PMD - MAE

Image Segmentation

PMD - IoU

Image Segmentation

PMD - F-measure

Speech Synthesis

LibriTTS - Periodicity

3D Object Detection

ScanNetV2 - [email protected]

Unsupervised Semantic Segmentation with Language-image Pre-training

Cityscapes val - mIoU

Unsupervised Semantic Segmentation with Language-image Pre-training

COCO-Stuff-171 - mIoU

Unsupervised Semantic Segmentation with Language-image Pre-training

PASCAL Context-59 - mIoU

Unsupervised Semantic Segmentation with Language-image Pre-training

COCO-Object - mIoU

Zero-Shot Video Question Answer

Video-MME (w/o subs) - Accuracy (%)

Thermal Image Segmentation

MFN Dataset - mIOU

Video Object Detection

ImageNet VID - MAP

Molecular Property Prediction

BBBP - ROC-AUC

Molecular Property Prediction

FreeSolv - RMSE

3D Object Detection

nuScenes - mAAE

Monocular Depth Estimation

NYU-Depth V2 - absolute relative error

Motion Synthesis

AIOZ-GDANCE - FID

Motion Synthesis

AIOZ-GDANCE - MMC

Motion Synthesis

AIOZ-GDANCE - GMC

3D Object Detection

nuScenes LiDAR only - NDS

3D Object Detection

nuScenes LiDAR only - mAP

3D Object Detection

nuScenes LiDAR only - NDS (val)

3D Object Detection

nuScenes LiDAR only - mAP (val)

Point Tracking

TAP-Vid-DAVIS - Average Jaccard

Point Tracking

TAP-Vid-DAVIS - Occlusion Accuracy

Temporal Action Localization

HACS - Average-mAP

Temporal Action Localization

HACS - [email protected]

Temporal Action Localization

HACS - [email protected]

Temporal Action Localization

HACS - [email protected]

Generalized Zero Shot skeletal action recognition

NTU RGB+D 120 - Harmonic Mean (10 unseen classes)

Low-Light Image Enhancement

LOL-v2 - Average PSNR

Low-Light Image Enhancement

LOL-v2 - SSIM

Low-Light Image Enhancement

LOL-v2 - LPIPS

Low-Light Image Enhancement

LOL-v2-synthetic - PSNR

Low-Light Image Enhancement

LOL-v2-synthetic - SSIM

Object Detection

PKU-DDD17-Car - mAP50

3D Semantic Scene Completion from a single RGB image

NYUv2 - mIoU

Overlapped 100-10

ADE20K - Mean IoU (test)

Video Quality Assessment

LIVE-VQC - PLCC

Unsupervised Video Object Segmentation

FBMS test - J

Open Vocabulary Object Detection

LVIS v1.0 - AP novel-LVIS base training

Facial Action Unit Detection

DISFA - Average F1

Facial Expression Recognition (FER)

RAF-DB - Overall Accuracy

Object Detection

CrowdHuman (full body) - mMR

Object Detection

InOutDoor - AP

Object Detection

EventPed - AP

Object Detection

STCrowd - AP

Domain Generalization

GTA-to-Avg(Cityscapes,BDD,Mapillary) - mIoU

3D Hand Pose Estimation

FreiHAND - PA-MPVPE

3D Hand Pose Estimation

FreiHAND - PA-F@5mm

3D Hand Pose Estimation

FreiHAND - PA-F@15mm

3D Hand Pose Estimation

HInt: Hand Interactions in the wild - [email protected] (New Days) All

3D Hand Pose Estimation

HInt: Hand Interactions in the wild - [email protected] (VISOR) All

3D Hand Pose Estimation

HInt: Hand Interactions in the wild - [email protected] (NewDays) Visible

3D Hand Pose Estimation

HInt: Hand Interactions in the wild - [email protected] (VISOR) Visible

3D Hand Pose Estimation

HInt: Hand Interactions in the wild - [email protected] (NewDays) Occ

3D Hand Pose Estimation

HO-3D v3 - PA-MPJPE

3D Hand Pose Estimation

HO-3D v3 - PA-MPVPE

3D Hand Pose Estimation

HO-3D v3 - F@5mm

3D Hand Pose Estimation

HO-3D v3 - F@15mm

3D Hand Pose Estimation

HO-3D v3 - AUC_J

3D Hand Pose Estimation

HO-3D v3 - AUC_V

Cross-modal retrieval with noisy correspondence

CC152K - Image-to-text R@1

Cross-modal retrieval with noisy correspondence

CC152K - Text-to-image R@5

Cross-modal retrieval with noisy correspondence

COCO-Noisy - Image-to-text R@10

Robot Manipulation Generalization

The COLOSSEUM - Average decrease average across all perturbations

Video Polyp Segmentation

SUN-SEG-Hard (Unseen) - Dice

Video Polyp Segmentation

SUN-SEG-Easy - Dice

Video Polyp Segmentation

SUN-SEG-Hard - Dice

Video Polyp Segmentation

SUN-SEG-Easy (Unseen) - Dice

Video Panoptic Segmentation

VIPSeg - VPQ

Object Detection In Aerial Images

HRSC2016 - mAP-07

Object Detection In Aerial Images

HRSC2016 - mAP-12

Video Frame Interpolation

Xiph-2K - PSNR

Video Frame Interpolation

Xiph-4k - SSIM

Video Frame Interpolation

SNU-FILM (easy) - SSIM

Video Frame Interpolation

X4K1000FPS-2K - PSNR

Video Frame Interpolation

X4K1000FPS-2K - SSIM

Few-Shot Semantic Segmentation

COCO-20i (2-way 1-shot) - mIoU

Self-supervised Scene Flow Estimation

Argoverse 2 - EPE 3-Way

Self-supervised Scene Flow Estimation

Argoverse 2 - EPE Foreground Static

Self-supervised Scene Flow Estimation

Argoverse 2 - EPE Background Static

Zero-Shot Video Question Answer

NExT-QA - Accuracy

Zero-Shot Video Question Answer

ActivityNet-QA - Accuracy

Zero-Shot Video Question Answer

TGIF-QA - Accuracy

Zero-Shot Video Question Answer

TGIF-QA - Confidence Score

Zero-Shot Video Question Answer

MSRVTT-QA - Confidence Score

Zero-Shot Video Question Answer

EgoSchema (subset) - Accuracy

Single Image Desnowing

CSD - Average PSNR (dB)

Referring Expression Segmentation

RefCOCO+ val - Overall IoU

Referring Expression Segmentation

RefCOCO+ testA - Overall IoU

Referring Expression Segmentation

RefCOCO testB - Overall IoU

Referring Expression Segmentation

RefCOCO testA - Overall IoU

Referring Expression Segmentation

RefCOCO+ test B - Overall IoU

Image Dehazing

SOTS Indoor - PSNR

Image Dehazing

O-Haze - PSNR

Image Dehazing

I-Haze - PSNR

Saliency Prediction

SALICON - AUC

Saliency Prediction

SALICON - KLD

Saliency Prediction

SALECI - KL

Skeleton Based Action Recognition

First-Person Hand Action Benchmark - 1:1 Accuracy

Hand Gesture Recognition

SHREC 2017 - 14 Gestures Accuracy

Hand Gesture Recognition

SHREC 2017 - 28 Gestures Accuracy

Hand Gesture Recognition

DHG-14 - Accuracy

Hand Gesture Recognition

DHG-28 - Accuracy

Few-Shot Learning

DTD - 8-shot Accuracy

Few-Shot Learning

DTD - 4-shot Accuracy

Few-Shot Learning

DTD - 16-shot Accuracy

Mitigating Contextual Bias

FGVC Aircraft - Top-1 Accuracy (%)

Mitigating Contextual Bias

FGVC Aircraft - OOD Accuracy (%)

Video Super-Resolution

REDS4- 4x upscaling - PSNR

Video Super-Resolution

REDS4- 4x upscaling - SSIM

Video Polyp Segmentation

SUN-SEG-Hard (Unseen) - Sensitivity

Video Polyp Segmentation

SUN-SEG-Hard (Unseen) - S-Measure

Video Polyp Segmentation

SUN-SEG-Hard (Unseen) - mean E-measure

Video Polyp Segmentation

SUN-SEG-Hard (Unseen) - weighted F-measure

Video Polyp Segmentation

SUN-SEG-Hard (Unseen) - mean F-measure

Video Polyp Segmentation

SUN-SEG-Easy (Unseen) - Sensitivity

Video Polyp Segmentation

SUN-SEG-Easy (Unseen) - S measure

Video Polyp Segmentation

SUN-SEG-Easy (Unseen) - mean E-measure

Video Polyp Segmentation

SUN-SEG-Easy (Unseen) - weighted F-measure

Video Polyp Segmentation

SUN-SEG-Easy (Unseen) - mean F-measure

Lipreading

Lip Reading in the Wild - Top-1 Accuracy

Multiple Object Tracking

SportsMOT - HOTA

Multiple Object Tracking

SportsMOT - IDF1

Multiple Object Tracking

SportsMOT - AssA

Visual Question Answering

MMBench - GPT-3.5 score

Zero-Shot Video Question Answer

IntentQA - Accuracy

Video-based Generative Performance Benchmarking (Consistency)

VideoInstruct - gpt-score

Zero-Shot Composed Image Retrieval (ZS-CIR)

Fashion IQ - (Recall@10+Recall@50)/2

3D Lane Detection

Apollo Synthetic 3D Lane - F1

Zero-Shot Video Question Answer

MSRVTT-QA - Accuracy

Zero-Shot Video Question Answer

MSVD-QA - Accuracy

Long-range modeling

LRA - Text

Long-range modeling

LRA - Retrieval

Long-range modeling

LRA - Image

Zero-Shot Video Question Answer

Video-MME - Accuracy (%)

Zero-Shot Video Question Answer

EgoSchema (fullset) - Accuracy

Audio Classification

ICBHI Respiratory Sound Database - ICBHI Score

Object Detection

AI-TOD - AP

Object Detection

AI-TOD - AP50

Object Detection

AI-TOD - AP75

Object Detection

AI-TOD - APvt

Object Detection

AI-TOD - APt

Object Detection

AI-TOD - APs

Generalized Zero-Shot Learning

SUN Attribute - Harmonic mean

Generalized Zero-Shot Learning

AwA2 - Harmonic mean

Unsupervised Domain Adaptation

Market to Duke - mAP

Unsupervised Domain Adaptation

Market to Duke - rank-5

Unsupervised Domain Adaptation

Market to Duke - rank-10

Unsupervised Domain Adaptation

Duke to MSMT - mAP

Unsupervised Domain Adaptation

Duke to MSMT - rank-1

Unsupervised Domain Adaptation

Duke to MSMT - rank-10

Unsupervised Domain Adaptation

Duke to MSMT - rank-5

Unsupervised Domain Adaptation

Market to MSMT - mAP

Unsupervised Domain Adaptation

Market to MSMT - rank-1

Unsupervised Domain Adaptation

Market to MSMT - rank-10

Unsupervised Domain Adaptation

Market to MSMT - rank-5

Unsupervised Domain Adaptation

Duke to Market - mAP

Unsupervised Domain Adaptation

Duke to Market - rank-1

Unsupervised Domain Adaptation

Duke to Market - rank-5

Unsupervised Domain Adaptation

Duke to Market - rank-10

Few Shot Action Recognition

Kinetics-100 - Accuracy

Single-View 3D Reconstruction

GSO - Chamfer Distance

Action Anticipation

EPIC-KITCHENS-100 - Recall@5

Saliency Prediction

SALICON - CC

Saliency Prediction

SALICON - SIM

Source-Free Domain Adaptation

VisDA-2017 - Accuracy

Visual Question Answering

MM-Vet - GPT-4 score

Cross-modal retrieval with noisy correspondence

CC152K - Image-to-text R@10

Cross-modal retrieval with noisy correspondence

CC152K - R-Sum

Generalized Referring Expression Segmentation

gRefCOCO - gIoU

Generalized Referring Expression Segmentation

gRefCOCO - cIoU

Cross-Domain Few-Shot Object Detection

UODD - mAP

3D Semantic Scene Completion from a single RGB image

KITTI-360 - mIoU

3D Semantic Scene Completion from a single RGB image

SemanticKITTI - mIoU

Supervised Video Summarization

SumMe - Kendall's Tau

Supervised Video Summarization

SumMe - Spearman's Rho

Multiple Object Tracking

KITTI Tracking test - MOTA

Multiple Object Tracking

KITTI Tracking test - HOTA

Crowd Counting

JHU-CROWD++ - MAE

Crowd Counting

UCF CC 50 - MAE

Crowd Counting

ShanghaiTech A - MAE

Crowd Counting

ShanghaiTech A - MSE

Few-Shot Object Detection

ODinW-13 - Average Score

Few-Shot Object Detection

ODinW-35 - Average Score

Zero-Shot Object Detection

ODinW - Average Score

Zero-Shot Object Detection

LVIS v1.0 minival - AP

Object Detection

ODinW Full-Shot 13 Tasks - AP

Crowd Counting

ShanghaiTech A - RMSE

Crowd Counting

ShanghaiTech B - MAE

Time Series Forecasting

ETTm2 (192) Multivariate - MSE

Time Series Forecasting

ETTm2 (96) Multivariate - MSE

Time Series Forecasting

ETTm1 (192) Multivariate - MSE

Time Series Forecasting

ETTm1 (720) Multivariate - MSE

Time Series Forecasting

ETTm1 (336) Multivariate - MSE

Time Series Forecasting

ETTm1 (96) Multivariate - MSE

Time Series Forecasting

ETTm2 (336) Multivariate - MSE

Time Series Forecasting

ETTm2 (720) Multivariate - MSE

visual instruction following

LLaVA-Bench - avg score

Semi-supervised Change Detection

LEVIR-CD - 5% labeled data - IoU

Semi-supervised Change Detection

WHU - 10% labeled data - IoU

Semi-supervised Change Detection

WHU - 5% labeled data - IoU

Rgb-T Tracking

GTOT - Success

Rgb-T Tracking

RGBT210 - Precision

Rgb-T Tracking

RGBT210 - Success

Heterogeneous Node Classification

ACM (Heterogeneous Node Classification) - Micro-F1

Heterogeneous Node Classification

Freebase (Heterogeneous Node Classification) - Micro-F1

Generative 3D Object Classification

Objaverse - Objaverse (I)

Generative 3D Object Classification

Objaverse - Objaverse (Average)

Generative 3D Object Classification

Objaverse - Objaverse (C)

Rgb-T Tracking

GTOT - Precision

Audio Classification

SSC - Accuracy

Audio Classification

SHD - Percentage correct

Classify murmurs

CirCor DigiScope - Weighted Accuracy

Zero-Shot Video Question Answer

ActivityNet-QA - Confidence Score

Zero-Shot Video Question Answer

MSVD-QA - Confidence Score

Few-Shot 3D Point Cloud Classification

ModelNet40 10-way (10-shot) - Overall Accuracy

Skeleton Based Action Recognition

UAV-Human - CSv1(%)

Skeleton Based Action Recognition

UAV-Human - CSv2(%)

Image Segmentation

RMAS - mIoU

Image Segmentation

RMAS - E-measure

Image Segmentation

RMAS - MAE

Math Word Problem Solving

SVAMP - Accuracy

3D Human Pose Estimation

RICH - MPJPE

3D Human Pose Estimation

RICH - PA-MPJPE

Semi-supervised Change Detection

WHU - 20% labeled data - IoU

Semi-supervised Change Detection

LEVIR-CD - 20% labeled data - IoU

Semi-supervised Change Detection

LEVIR-CD - 10% labeled data - IoU

Semi-supervised Change Detection

WHU - 40% labeled data - IoU

Semi-supervised Change Detection

LEVIR-CD - 40% labeled data - IoU

Graph Classification

MNIST - Accuracy

Graph Classification

Peptides-func - AP

3D Human Pose Estimation

RICH - MPVPE

Motion Synthesis

HumanML3D - Multimodality

Motion Synthesis

InterHuman - FID

Motion Synthesis

InterHuman - R-Precision Top3

Dichotomous Image Segmentation

DIS-TE1 - max F-Measure

Dichotomous Image Segmentation

DIS-TE1 - weighted F-measure

Dichotomous Image Segmentation

DIS-TE1 - MAE