Skip to content

Commit

Permalink
Pull request #16: Release/0.2.0
Browse files Browse the repository at this point in the history
Merge in FO00039/thetis-public-github from release/0.2.0 to main
  • Loading branch information
Kueppers, Fabian, Dr. (EFS-GKE09) authored and Kueppers, Fabian, Dr. (EFS-GKE09) committed Mar 20, 2024
1 parent 0f5b4cb commit a093ff4
Show file tree
Hide file tree
Showing 22 changed files with 903 additions and 260 deletions.
20 changes: 10 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -120,8 +120,8 @@ result = thetis(

Thetis expects two Pandas data frames to run an evaluation:

* *Annotations*: `pd.DataFrame` with ground-truth information about the data set. The column `target` is required, holding
the ground-truth target information. Furthermore, columns for sensitive attributes are expected that have been
* *Annotations*: `pd.DataFrame` with ground truth information about the data set. The column `target` is required, holding
the ground truth target information. Furthermore, columns for sensitive attributes are expected that have been
configured for the AI Fairness evaluation.
* *Predictions*: `pd.DataFrame` with the AI predictions for each sample in the data set. The columns `labels` and
`confidence` are required, holding information about the predicted label and the respective prediction
Expand Down Expand Up @@ -216,16 +216,16 @@ for filename in tqdm(annotation_files, desc="Running inference on images ..."):

#### Expected Data Format for Object Detection

After loading the ground-truth information and running inference using an AI model (see example above),
After loading the ground truth information and running inference using an AI model (see example above),
we must format our predictions and annotations in a way that can be ingested by Thetis. In object detection evaluation mode,
Thetis expects a Python dictionary for the predictions and annotations, where the keys represent the image identifiers
(e.g., image name) and the values represent the individual (predicted or ground-truth) objects within a single frame.
(e.g., image name) and the values represent the individual (predicted or ground truth) objects within a single frame.

```python
import pandas as pd

# Thetis expects a dictionary with image name as key and a pd.DataFrame with predicted information as value.
# A similar format is also expected for the ground-truth annotations with extra sensitive attributes
# A similar format is also expected for the ground truth annotations with extra sensitive attributes
# used for fairness evaluation. The field "__meta__" is always required with meta information for each frame.
annotations = {"__meta__": pd.DataFrame(columns=["width", "height"])}
predictions = {}
Expand All @@ -250,7 +250,7 @@ for pred, anns in data:
"ymax": predicted_boxes[:, 3][filter],
})

# add ground-truth information also as pd.DataFrame with additional sensitive attributes
# add ground truth information also as pd.DataFrame with additional sensitive attributes
annotations[filename] = pd.DataFrame.from_dict({
"target": anns["classes"],
"gender": anns["gender"],
Expand All @@ -265,13 +265,13 @@ for pred, anns in data:
annotations["__meta__"].loc[filename] = [anns["image_width"], anns["image_height"]]
```

*Important*: The dictionary for the ground-truth annotations requires a key `__meta__` which holds width and height
*Important*: The dictionary for the ground truth annotations requires a key `__meta__` which holds width and height
information for each image within the data set, provided as Pandas DataFrame. Note that the index of the entries within
this DataFrame must match with the keys (i.e. image identifiers) of the Python dictionaries.

#### Running AI Safety Evaluation with Thetis

Given your data is in the right format, simply call Thetis with the predictions, the ground-truth information and the
Given your data is in the right format, simply call Thetis with the predictions, the ground truth information and the
prepared configuration file:

```python
Expand Down Expand Up @@ -300,6 +300,6 @@ We capture the dictionary as `result` and can access the different evaluation as
or `'BAD'` depending on the rating score.

Note that the remaining evaluation metrics are grouped by the specified IoU scores which are used for the matching
of predicted objects with ground-truth ones (e.g., an IoU score of 0.5 might be used to decide if a prediction
has matched an existing ground-truth object or not). In the configuration file, you can specify multiple IoU scores
of predicted objects with ground truth ones (e.g., an IoU score of 0.5 might be used to decide if a prediction
has matched an existing ground truth object or not). In the configuration file, you can specify multiple IoU scores
that are taken into account for the final evaluation process.
8 changes: 8 additions & 0 deletions docs/source/bibliography.bib
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,14 @@ @inproceedings{Agarwal2018
organization={PMLR}
}

@InProceedings{Steinberg2020,
author = {Steinberg, Daniel and Reid, Alistair and O'Callaghan, Simon},
title = {Fairness Measures for Regression via Probabilistic Classification},
booktitle = {2nd Ethics of Data Science Conference},
year = {2020},
month = mar,
}

@InProceedings{Guo2018,
author = {Chuan Guo and Geoff Pleiss and Yu Sun and Kilian Q. Weinberger},
title = {{On Calibration of Modern Neural Networks}},
Expand Down
93 changes: 58 additions & 35 deletions docs/source/configuration.rst
Original file line number Diff line number Diff line change
Expand Up @@ -27,70 +27,79 @@ An exemplary YAML configuration for Thetis must have the following form:
revision: "r1"
# examination task. Can be one of: "classification" (binary/multi-class classification),
# "detection" (image-based object detection)
# Examination task. Can be one of: "classification" (binary/multi-class classification),
# "detection" (image-based object detection) or "regression"
task: "classification"
# language of the final report. Can be one of: "en", "de"
# Language of the final report. Can be one of: "en", "de"
language: "en"
# list of distinct classes that can occur within the data set
distinct_classes: ["no person", "person"]
# Task-specific settings. Required and available fields depend on the selected task.
task_settings:
# in binary classification (when 'distinct_classes' has length of 2), you must specify a positive label out of
# the list of available classes. This is important since you only give a single "confidence" for each prediction,
# targeting the probability of the positive class
binary_positive_label: "person"
# List of distinct classes that can occur within the data set (can only be set for classification or
# object detection). If specified then this parameter cannot be empty.
distinct_classes: ["no person", "person"]
# you can specify some general settings here (atm only detection-specific settings)
task_settings:
# In binary classification (when 'distinct_classes' has length of 2), you must specify a positive label out of
# the list of available classes. This is important since you only give a single "confidence" for each prediction,
# targeting the probability of the positive class. May only be specified for binary classification.
binary_positive_label: "person"
# bounding-box format. Can be one of: "xyxy" (xmin, ymin, xmax, ymax), "xywh" (xmin, ymin, width, height),
# Bounding-box format. Can be one of: "xyxy" (xmin, ymin, xmax, ymax), "xywh" (xmin, ymin, width, height),
# or "cxcywh" (center x, center y, width, height).
detection_bbox_format: "xyxy"
# list with IoU scores used for object detection evaluation
# List with IoU scores used for object detection evaluation
# Note: the IoU score "0.5" is always active for the evaluation. You can specify more IoU scores if you want
detection_bbox_ious: [0.75]
# set to true if the bounding boxes are also inferred with a separate variance score (currently not supported)
# String with bounding box matching strategy. Must be one of: "exclusive", "max".
detection_bbox_matching: "exclusive"
# Set to true if the bounding boxes are also inferred with a separate variance score (currently not supported)
detection_bbox_probabilistic: false
# in detection mode, it is possible to set a confidence threshold
# In detection mode, it is possible to set a confidence threshold
# to discard blurry predictions with low confidence
detection_confidence_thr: 0.2
# settings for the data evaluation routine
# In detection mode it is possible to specify tolerance zone outside image bounds within which clipping is applied. The boxes within these zones are
# clipped to the image dimensions. For boxes outside the specified tolerance, an error is raised instead.
detection_bbox_clipping: 20%
# Settings for the data evaluation routine
data_evaluation:
examine: true
# settings for the AI baseline performance evaluation (which should be always performed!)
# Settings for the AI baseline performance evaluation (which should be always performed!)
performance:
examine: true
# settings for the evaluation of confidence calibration
# Settings for the evaluation of confidence calibration
uncertainty:
examine: true
# number of bins used for ECE calculation, required for classification and detection evaluation
# Number of bins used for ECE calculation, required for classification and detection evaluation
ece_bins : 20
# during ECE/D-ECE computation, bins with a number of samples less than this threshold are ignored
# required for classification and detection evaluation
# During ECE/D-ECE computation, bins with a number of samples less than this threshold are ignored
# Required for classification and detection evaluation
ece_sample_threshold: 10
# number of bins used for D-ECE calculation (object detection), required for detection evaluation
# Number of bins used for D-ECE calculation (object detection), required for detection evaluation
dece_bins: 5
# settings for the evaluation of model fairness
# Settings for the evaluation of model fairness
fairness:
examine: true
# specify sensitive attributes that are used for fairness evaluation. For each of these attributes,
# Specify sensitive attributes that are used for fairness evaluation. For each of these attributes,
# you need to specify the classes for which the attributes are actually valid (out of the labels
# within 'distinct_classes' list). You can also type "all" to mark validity for all classes.
gender: ["no person", "person"]
age: "all"
# within 'distinct_classes' list). You can also leave it empty or type "all" to mark validity for all classes.
sensitive_attributes:
gender: ["no person", "person"]
age: "all"
General Application Settings
Expand All @@ -113,10 +122,10 @@ In the following, we give a detailed overview about all possible general configu
- Revision of the AI model used to generate predictions.
* - :code:`meta/dataset/name`
- string
- Name of the data set holding the ground-truth information.
- Name of the data set holding the ground truth information.
* - :code:`meta/dataset/revision`
- string
- Revision of the data set holding the ground-truth information.
- Revision of the data set holding the ground truth information.


.. list-table:: General application settings
Expand All @@ -132,10 +141,10 @@ In the following, we give a detailed overview about all possible general configu
* - :code:`language`
- string
- Language of the final evaluation report. Can be one of: "en" (US English), "de" (German).
* - :code:`distinct_classes`
* - :code:`task_settings/distinct_classes`
- list of int or string
- List of distinct classes that can occur within the data set.
* - :code:`binary_positive_label`
- List of distinct classes that can occur within the data set. Only to be provided in case of Classification or Detection
* - :code:`task_settings/binary_positive_label`
- int or string
- In binary classification (when 'distinct_classes' has length of 2), you must specify a positive label out of
the list of available classes. This is important since you only give a single "confidence" for each prediction,
Expand All @@ -148,12 +157,26 @@ In the following, we give a detailed overview about all possible general configu
- list of float
- List with IoU scores (in [0, 1] interval) used for object detection evaluation.
Note: the IoU score "0.5" is always active for the evaluation. You can specify more IoU scores if you want.
* - :code:`task_settings/detection_bbox_matching`
- string
- String with bounding box matching strategy within object detection evalulation. The strategy of matching the
predicted bounding boxes with the ground truth ones must either be "exclusive" with
exclusive bounding box matching where each prediction and each ground truth are assigned
to a single counter-part, or "max" with maximum/non-exclusive bounding box matching where each ground truth object
may have multiple predictions assigned to it. Default is "exclusive".
* - :code:`task_settings/detection_bbox_probabilistic`
- boolean
- Currently not used.
* - :code:`task_settings/detection_confidence_thr`
- float
- In detection mode, it is possible to set a confidence threshold (in [0, 1] interval) to discard blurry predictions with low confidence.
* - :code:`task_settings/detection_bbox_clipping`
- int
- In detection mode, it is possible to specify tolerance zone outside the image in case of boxes that are out of image bounds.
This can be ommitted, in which case no clipping is applied and an error is raised if a box is out of image bounds.
Alternatively, it can be set to relative(relative to image width and height)% ([0-100]%) or absolute values in px ([int]px).
These specify the dimensions outside the image, such that if any boxes extend into this tolerance zone, they will get clipped to the image dimensions.
If boxes exceed these tolerance zones no clipping will be applied, an error will be raised instead.

Configuration of Safety Evaluation
----------------------------------
Expand Down Expand Up @@ -215,10 +238,10 @@ Configuration of Safety Evaluation
* - :code:`fairness/examine`
- boolean
- Enables/disables the AI fairness evaluation for the final rating & reporting.
* - :code:`fairness/<label name>`
- string or list of int/string
* - :code:`fairness/sensitive_attributes/<label name>`
- optional string or list of int/string
- Specify one or multiple sensitive attributes (e.g., gender or age) that are used for fairness evaluation.
The value of this entry is a list of target classes (given by "distinct_classes" parameter) for which the
sensitive attribute is valid. For example, if "distinct_classes" specifies labels "person" and "car", a
sensitive attribute for "gender" might only be valid for target label "person". If the attribute is valid for
all specified target labels, you can also pass the value "all".
all specified target labels, you can also leave the value empty or pass "all".
Loading

0 comments on commit a093ff4

Please sign in to comment.