Skip to content

Latest commit

 

History

History
363 lines (277 loc) · 22.9 KB

README.md

File metadata and controls

363 lines (277 loc) · 22.9 KB

TF-Vision Model Garden

⚠️ Disclaimer: Checkpoints are based on training with publicly available datasets. Some datasets contain limitations, including non-commercial use limitations. Please review the terms and conditions made available by third parties before using the datasets provided. Checkpoints are licensed under Apache 2.0.

⚠️ Disclaimer: Datasets hyperlinked from this page are not owned or distributed by Google. Such datasets are made available by third parties. Please review the terms and conditions made available by the third parties before using the data.

Table of Contents

Introduction

TF-Vision modeling library for computer vision provides a collection of baselines and checkpoints for image classification, object detection, and segmentation.

Backbones

Backbones
DilatedResNet
EfficientNet
MobileDet
MobileNet
ResNet
ResNet3D
RevNet
SpineNet
SpineNetMobile
VisionTransformer

Decoders

Decoders
ASPP
FPN
NASFPN

Heads

Heads
DetectionHead
MaskHead
MaskScoring
RPNHead
RetinaNetHead
SegmentationHead

Image Classification

ResNet models trained with vanilla settings

  • Models are trained from scratch with batch size 4096 and 1.6 initial learning rate.
  • Linear warmup is applied for the first 5 epochs.
  • Models trained with l2 weight regularization and ReLU activation.
Model Resolution Epochs Top-1 Top-5 Download
ResNet-50 224x224 90 76.1 92.9 config
ResNet-50 224x224 200 77.1 93.5 config | ckpt
ResNet-101 224x224 200 78.3 94.2 config | ckpt
ResNet-152 224x224 200 78.7 94.3 config | ckpt

ResNet-RS models trained with various settings

We support state-of-the-art ResNet-RS image classification models with features:

  • ResNet-RS architectural changes and Swish activation. (Note that ResNet-RS adopts ReLU activation in the paper.)
  • Regularization methods including Random Augment, 4e-5 weight decay, stochastic depth, label smoothing and dropout.
  • New training methods including a 350-epoch schedule, cosine learning rate and EMA.
  • Configs are in this directory.
Model Resolution Params (M) Top-1 Top-5 Download
ResNet-RS-50 160x160 35.7 79.1 94.5 config | ckpt
ResNet-RS-101 160x160 63.7 80.2 94.9 config | ckpt
ResNet-RS-101 192x192 63.7 81.3 95.6 config | ckpt
ResNet-RS-152 192x192 86.8 81.9 95.8 config | ckpt
ResNet-RS-152 224x224 86.8 82.5 96.1 config | ckpt
ResNet-RS-152 256x256 86.8 83.1 96.3 config | ckpt
ResNet-RS-200 256x256 93.4 83.5 96.6 config | ckpt
ResNet-RS-270 256x256 130.1 83.6 96.6 config | ckpt
ResNet-RS-350 256x256 164.3 83.7 96.7 config | ckpt
ResNet-RS-350 320x320 164.3 84.2 96.9 config | ckpt

Vision Transformer (ViT)

We support ViT and DEIT implementations. ViT models trained under the DEIT settings:

model resolution Top-1 Top-5 Download
ViT-ti16 224x224 73.4 91.9 ckpt
ViT-s16 224x224 79.4 94.7 ckpt
ViT-b16 224x224 81.8 95.8 ckpt
ViT-l16 224x224 82.2 95.8 ckpt

Object Detection and Instance Segmentation

Common Settings and Notes

  • We provide models adopting ResNet-FPN and SpineNet backbones based on detection frameworks:
  • Models are all trained on COCO train2017 and evaluated on COCO val2017.
  • Training details:
    • Models finetuned from ImageNet pretrained checkpoints adopt the 12 or 36 epochs schedule. Models trained from scratch adopt the 350 epochs schedule.
    • The default training data augmentation implements horizontal flipping and scale jittering with a random scale between [0.5, 2.0].
    • Unless noted, all models are trained with l2 weight regularization and ReLU activation.
    • We use batch size 256 and stepwise learning rate that decays at the last 30 and 10 epoch.
    • We use square image as input by resizing the long side of an image to the target size then padding the short side with zeros.

COCO Object Detection Baselines

RetinaNet (ImageNet pretrained)

Backbone Resolution Epochs FLOPs (B) Params (M) Box AP Download
R50-FPN 640x640 12 97.0 34.0 34.3 config
R50-FPN 640x640 72 97.0 34.0 36.8 config | ckpt

RetinaNet (Trained from scratch)

training features including:

  • Stochastic depth with drop rate 0.2.
  • Swish activation.
Backbone Resolution Epochs FLOPs (B) Params (M) Box AP Download
SpineNet-49 640x640 500 85.4 28.5 44.2 config | ckpt | TB.dev
SpineNet-96 1024x1024 500 265.4 43.0 48.5 config | ckpt | TB.dev
SpineNet-143 1280x1280 500 524.0 67.0 50.0 config | ckpt | TB.dev

Mobile-size RetinaNet (Trained from scratch):

Backbone Resolution Epochs FLOPs (B) Params (M) Box AP Download
MobileNetv2 256x256 600 - 2.27 23.5 config
Mobile SpineNet-49 384x384 600 1.0 2.32 28.1 config | ckpt

YOLOv7 (Trained from scratch)

Variant Resolution Epochs FLOPs (B) Params (M) Box AP Download
YOLOv7 640x640 300 53.16 44.57 50.5 config | ckpt

Instance Segmentation Baselines

Mask R-CNN (Trained from scratch)

Backbone Resolution Epochs FLOPs (B) Params (M) Box AP Mask AP Download
ResNet50-FPN 640x640 350 227.7 46.3 42.3 37.6 config
SpineNet-49 640x640 350 215.7 40.8 42.6 37.9 config
SpineNet-96 1024x1024 500 315.0 55.2 48.1 42.4 config
SpineNet-143 1280x1280 500 498.8 79.2 49.3 43.4 config

Cascade RCNN-RS (Trained from scratch)

Backbone Resolution Epochs Params (M) Box AP Mask AP Download
SpineNet-49 640x640 500 56.4 46.4 40.0 config
SpineNet-96 1024x1024 500 70.8 50.9 43.8 config
SpineNet-143 1280x1280 500 94.9 51.9 45.0 config

Semantic Segmentation

  • We support DeepLabV3 and DeepLabV3+ architectures, with Dilated ResNet backbones.
  • Backbones are pre-trained on ImageNet.

PASCAL-VOC

Model Backbone Resolution Steps mIoU Download
DeepLabV3 Dilated Resnet-101 512x512 30k 78.7
DeepLabV3+ Dilated Resnet-101 512x512 30k 79.2 ckpt

CITYSCAPES

Model Backbone Resolution Steps mIoU Download
DeepLabV3+ Dilated Resnet-101 1024x2048 90k 78.79

Video Classification

Common Settings and Notes

Kinetics-400 Action Recognition Baselines

Model Input (frame x stride) Top-1 Top-5 Download
SlowOnly 8 x 8 74.1 91.4 config
SlowOnly 16 x 4 75.6 92.1 config
R3D-50 32 x 2 77.0 93.0 config
R3D-RS-50 32 x 2 78.2 93.7 config
R3D-RS-101 32 x 2 79.5 94.2 -
R3D-RS-152 32 x 2 79.9 94.3 -
R3D-RS-200 32 x 2 80.4 94.4 -
R3D-RS-200 48 x 2 81.0 - -
MoViNet-A0-Base 50 x 5 69.40 89.18 -
MoViNet-A1-Base 50 x 5 74.57 92.03 -
MoViNet-A2-Base 50 x 5 75.91 92.63 -
MoViNet-A3-Base 120 x 2 79.34 94.52 -
MoViNet-A4-Base 80 x 3 80.64 94.93 -
MoViNet-A5-Base 120 x 2 81.39 95.06 -

Kinetics-600 Action Recognition Baselines

Model Input (frame x stride) Top-1 Top-5 Download
SlowOnly 8 x 8 77.3 93.6 config
R3D-50 32 x 2 79.5 94.8 config
R3D-RS-200 32 x 2 83.1 - -
R3D-RS-200 48 x 2 83.8 - -
MoViNet-A0-Base 50 x 5 72.05 90.92 config
MoViNet-A1-Base 50 x 5 76.69 93.40 config
MoViNet-A2-Base 50 x 5 78.62 94.17 config
MoViNet-A3-Base 120 x 2 81.79 95.67 config
MoViNet-A4-Base 80 x 3 83.48 96.16 config
MoViNet-A5-Base 120 x 2 84.27 96.39 config

More Documentations

Please read through the references in the examples/starter.