Extreme clicking for efficient object annotation

December 2019

tl;dr: Annotate extreme points to replace conventional bbox annotation leads to same accuracy, reduced annotation time and additional information.

This paper inspired ExtremeNet.

This work is extended to DEXTR: Deep Extreme Cut: From Extreme Points to Object Segmentation that extends extreme points to instance segmentation masks.

Conventional bbox annotation involves clicking on imaginary corners of a right box around the object. This is difficult as these corners are often outside the actual object and several adjustments are required to obtain a tight box.
Annotation by extreme point clicking is only 7s per object instance, 5x faster than the traditional way of drawing bbox.
- Extreme points are not imaginary but well defined points on the object
- No separate box adjustment step is required.
Add a qualification test. Find extreme points, find all pixels with x or y within 10 pixels of extreme values, include all pixels within 10 pixels of any of the selected pixels. All these pixels are acceptable. --> we may need to adjust these thresholds for smaller objects.

Two ways to obtain annotation: "annotation party" vs crowdsourcing. The former is too costly and crowdsourcing is essential for creating large datasets.
The bbox annotator need to pay attention to extreme points anyway to ensure accurate annotation. Clicking the top-left corner couples the localization and aligning hairlines of of top and left-most extreme point at the same time.
Using grabCut to automatically find masks. These masks can train CNN that is 1.5 mAP below that trained with full mask.

This is promising to help annotating on distorted and undistorted image simultaneously.
Can we train a patch-based segmentation model to help with this task?
Maybe we can use siam-mask to try auto annotation on videos.

Provide feedback