Skip to content

Latest commit

 

History

History
372 lines (299 loc) · 13.6 KB

README.md

File metadata and controls

372 lines (299 loc) · 13.6 KB

Auto annotation

Description

The application will be enabled automatically if OpenVINO™ component is installed. It allows to use custom models for auto annotation. Only models in OpenVINO™ toolkit format are supported. If you would like to annotate a task with a custom model please convert it to the intermediate representation (IR) format via the model optimizer tool. See OpenVINO documentation for details.

Installation

See the installation instructions for the OpenVINO component

Usage

To annotate a task with a custom model you need to prepare 4 files:

  1. Model config (*.xml) - a text file with network configuration.

  2. Model weights (*.bin) - a binary file with trained weights.

  3. Label map (*.json) - a simple json file with label_map dictionary like object with string values for label numbers. Example:

    {
      "label_map": {
        "0": "background",
        "1": "aeroplane",
        "2": "bicycle",
        "3": "bird",
        "4": "boat",
        "5": "bottle",
        "6": "bus",
        "7": "car",
        "8": "cat",
        "9": "chair",
        "10": "cow",
        "11": "diningtable",
        "12": "dog",
        "13": "horse",
        "14": "motorbike",
        "15": "person",
        "16": "pottedplant",
        "17": "sheep",
        "18": "sofa",
        "19": "train",
        "20": "tvmonitor"
      }
    }
  4. Interpretation script (*.py) - a file used to convert net output layer to a predefined structure which can be processed by CVAT. This code will be run inside a restricted python's environment, but it's possible to use some builtin functions like str, int, float, max, min, range.

    Also two variables are available in the scope:

    • detections - a list of dictionaries with detections for each frame:

      • frame_id - frame number
      • frame_height - frame height
      • frame_width - frame width
      • detections - output np.ndarray (See ExecutableNetwork.infer for details).
    • results - an instance of python class with converted results. Following methods should be used to add shapes:

      # xtl, ytl, xbr, ybr - expected values are float or int
      # label - expected value is int
      # frame_number - expected value is int
      # attributes - dictionary of attribute_name: attribute_value pairs, for example {"confidence": "0.83"}
      add_box(self, xtl, ytl, xbr, ybr, label, frame_number, attributes=None)
      
      # points - list of (x, y) pairs of float or int, for example [(57.3, 100), (67, 102.7)]
      # label - expected value is int
      # frame_number - expected value is int
      # attributes - dictionary of attribute_name: attribute_value pairs, for example {"confidence": "0.83"}
      add_points(self, points, label, frame_number, attributes=None)
      add_polygon(self, points, label, frame_number, attributes=None)
      add_polyline(self, points, label, frame_number, attributes=None)

Testing script

CVAT comes prepackaged with a small command line helper script to help develop interpretation scripts.

It includes a small user interface which allows users to feed in images and see the results using the user interfaces provided by OpenCV.

See the script and the documentation in the auto_annotation directory.

When using the Auto Annotation runner, it is often helpful to drop into a REPL prompt to interact with the variables directly. You can do this using the interact method from the code module.

# Import the interact method from the `code` module
from code import interact


for frame_results in detections:
  frame_height = frame_results["frame_height"]
  frame_width = frame_results["frame_width"]
  frame_number = frame_results["frame_id"]
  # Unsure what other data members are in the `frame_results`? Use the `interact method!
  interact(local=locals())
$ python cvat/utils/auto_annotation/run_models.py --py /path/to/myfile.py --json /path/to/mapping.json --xml /path/to/inference.xml --bin /path/to/inference.bin
Python 3.6.6 (default, Sep 26 2018, 15:10:10)
[GCC 4.2.1 Compatible Apple LLVM 10.0.0 (clang-1000.10.44.2)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> dir()
['__builtins__', 'frame_results', 'detections', 'frame_number', 'frame_height', 'interact', 'results', 'frame_width']
>>> type(frame_results)
<class 'dict'>
>>> frame_results.keys()
dict_keys(['frame_id', 'frame_height', 'frame_width', 'detections'])

When using the interact method, make sure you are running using the testing script, and ensure that you remove it before submitting to the server! If you don't remove it from the server, the code runners will hang during execution, and you'll have to restart the server to fix them.

Another useful development method is visualizing the results using OpenCV. This will be discussed more in the Semantic Segmentation section.

Examples

Links

Task labels: person vehicle non-vehicle

label_map.json:

{
"label_map": {
    "1": "person",
    "2": "vehicle",
    "3": "non-vehicle"
    }
}

Interpretation script for SSD based networks:

def clip(value):
  return max(min(1.0, value), 0.0)

for frame_results in detections:
  frame_height = frame_results["frame_height"]
  frame_width = frame_results["frame_width"]
  frame_number = frame_results["frame_id"]

  for i in range(frame_results["detections"].shape[2]):
    confidence = frame_results["detections"][0, 0, i, 2]
    if confidence < 0.5:
      continue

    results.add_box(
      xtl=clip(frame_results["detections"][0, 0, i, 3]) * frame_width,
      ytl=clip(frame_results["detections"][0, 0, i, 4]) * frame_height,
      xbr=clip(frame_results["detections"][0, 0, i, 5]) * frame_width,
      ybr=clip(frame_results["detections"][0, 0, i, 6]) * frame_height,
      label=int(frame_results["detections"][0, 0, i, 1]),
      frame_number=frame_number,
      attributes={
        "confidence": "{:.2f}".format(confidence),
      },
    )

Links

Task labels: left_eye right_eye tip_of_nose left_lip_corner right_lip_corner

label_map.json:

{
  "label_map": {
      "0": "left_eye",
      "1": "right_eye",
      "2": "tip_of_nose",
      "3": "left_lip_corner",
      "4": "right_lip_corner"
  }
}

Interpretation script:

def clip(value):
  return max(min(1.0, value), 0.0)

for frame_results in detections:
  frame_height = frame_results["frame_height"]
  frame_width = frame_results["frame_width"]
  frame_number = frame_results["frame_id"]

  for i in range(0, frame_results["detections"].shape[1], 2):
      x = frame_results["detections"][0, i, 0, 0]
      y = frame_results["detections"][0, i + 1, 0, 0]

      results.add_points(
        points=[(clip(x) * frame_width, clip(y) * frame_height)],
        label=i // 2, # see label map and model output specification,
        frame_number=frame_number,
      )

Semantic Segmentation

Links

label_map.json:

{
"label_map": {
        "1": "person",
        "2": "bicycle",
        "3": "car",
    }
}

Note that the above labels are not all the labels in the model! See here.

Interpretation script for a semantic segmentation network:

import numpy as np
import cv2
from skimage.measure import approximate_polygon, find_contours


for frame_results in detections:
    frame_height = frame_results['frame_height']
    frame_width = frame_results['frame_width']
    frame_number = frame_results['frame_id']
    detection = frame_results['detections']

	# The keys for the below two members will vary based on the model
    masks = frame_results['masks']
    boxes = frame_results['reshape_do_2d']

	for box_index, box in enumerate(boxes):
		# Again, these indexes specific to this model
		class_label = int(box[1])
		box_class_probability = box[2]
		
		if box_class_probability > 0.2:
			xmin = box[3] * frame_width
			ymin = box[4] * frame_height
			xmax = box[5] * frame_width
			ymax = box[6] * frame_width

			box_width = int(xmax - xmin)
			box_height =  int(ymin - ymax)

			# use the box index and class label index to find the appropriate mask
			# note that we need to convert the class label to a zero indexed array by subtracting `1`
			class_mask = masks[box_index][class_label - 1]

			# Class mask is a 33 x 33 matrix
			# resize it to the bounding box
			resized_mask = cv2.resize(class_mask, dsize(box_height, box_width), interpolation=cv2.INTER_CUBIC)

			# Each pixel is a probability, select every pixel above the probability threshold, 0.5
			# Do this using the boolean `>` method
			boolean_mask = (resized_mask > 0.5)

			# Convert the boolean values to uint8 
			uint8_mask = boolean_mask.astype(np.uint8) * 255

			# Change the x and y coordinates into integers
			xmin = int(round(xmin))
			ymin = int(round(ymin))
			xmax = xmin + box_width
			ymax = ymin + box_height

			# Create an empty blank frame, so that we can get the mask polygon in frame coordinates
			mask_frame = np.zeros((frame_height, frame_width), dtype=np.uint8)

			# Put the uint8_mask on the mask frame using the integer coordinates
			mask_frame[xmin:xmax, ymin:ymax] = uint8_mask

			mask_probability_threshold = 0.5
			# find the contours
			contours = find_contours(mask_frame, mask_probability_threshold)
			# every bounding box should only have a single contour
			contour = contours[0]
			contour = np.flip(contour, axis=1)

			# reduce the precision on the polygon
			polygon_mask = approximate_polygon(contour, tolerance=2.5)
			polygon_mask = polygon_mask.tolist()

			results.add_polygon(polygon_mask, class_label, frame_number)

Note that it is sometimes hard to see or understand what is happening in a script. Use of the computer vision module can help you visualize what is happening.

import cv2


for frame_results in detections:
    frame_height = frame_results['frame_height']
    frame_width = frame_results['frame_width']
    detection = frame_results['detections']

    masks = frame_results['masks']
    boxes = frame_results['reshape_do_2d']

	for box_index, box in enumerate(boxes):
		class_label = int(box[1])
		box_class_probability = box[2]
		
		if box_class_probability > 0.2:
			xmin = box[3] * frame_width
			ymin = box[4] * frame_height
			xmax = box[5] * frame_width
			ymax = box[6] * frame_width

			box_width = int(xmax - xmin)
			box_height =  int(ymin - ymax)

			class_mask = masks[box_index][class_label - 1]
			# Visualize the class mask!
			cv2.imshow('class mask', class_mask)
			# wait until user presses keys
			cv2.waitKeys()

			boolean_mask = (resized_mask > 0.5)
			uint8_mask = boolean_mask.astype(np.uint8) * 255

			# Visualize the class mask after it's been resized!
			cv2.imshow('class mask', uint8_mask)
			cv2.waitKeys()

Note that you should only use the above commands while running the Auto Annotation Model Runner. Running on the server will likely require a server restart to fix. The method cv2.destroyAllWindows() or cv2.destroyWindow('your-name-here') might be required depending on your implementation.

Available interpretation scripts

CVAT comes prepackaged with several out of the box interpretation scripts. See them in the open model zoo directory