Skip to content

Commit

Permalink
chg ! deduplicationset config
Browse files Browse the repository at this point in the history
  • Loading branch information
vitali-yanushchyk-valor committed Jan 21, 2025
1 parent 2cba530 commit 7b3eff1
Show file tree
Hide file tree
Showing 14 changed files with 291 additions and 298 deletions.
79 changes: 30 additions & 49 deletions docs/src/did/config.md
Original file line number Diff line number Diff line change
@@ -1,68 +1,49 @@
The configuration can be managed directly through the **admin panel**, which provides a simple way to modify settings without changing the codebase. Navigate to:
### Default Configuration

Home › Constance › Config
The default configuration can be managed directly through the **admin panel**, providing a simple way to modify settings without changing the codebase. Navigate to:

Here, you will find all the configurable settings that affect the behavior of the system, allowing for quick adjustments and better control over application behavior.
Home › Constance › Config

## Deep neural networks (DNN)
This section contains the **default configuration**, which applies to the system globally. It defines the base settings for the application and ensures consistent behavior across all deduplication sets.

The deep learning component of the system is crucial for performing advanced inference tasks, including **face detection**, **face recognition**, and **finding duplicate images** using a pre-trained model. These tasks are fundamental to ensuring the accuracy and efficiency of the system in identifying and managing images.
#### Key Parameters:

This component relies on **Convolutional Neural Networks (CNNs)**, a type of deep learning model particularly well-suited for processing visual data. CNNs are used to automatically extract relevant features from images, such as facial landmarks and distinctive patterns, without the need for manual feature engineering.
- **MODEL_NAME**
Specifies the face recognition model to be used for **encoding** face landmarks.

### DNN_BACKEND
- **DETECTOR_BACKEND**
Specifies the face detector backend to be used for **detecting** faces in images.

Specifies the computation backend to be used by [OpenCV](https://github.com/opencv/opencv) library for deep learning inference.
- **FACE_DISTANCE_THRESHOLD**
Specifies the maximum allowable **distance** between two face embeddings for them to be considered a match. This **tolerance threshold** is crucial for assessing whether two faces belong to the same individual. Lower values result in stricter matching, while higher values allow for more lenient matches.

### DNN_TARGET
---

Specifies the target device on which [OpenCV](https://github.com/opencv/opencv) library will perform the deep learning computations.
### Custom Configuration

Custom configurations can be created for specific **deduplication sets** through the **admin panel** by navigating to:

## Face Detection
Home › API › Configs

This component is responsible for locating and identifying faces in images. It uses advanced deep learning algorithms to scan images and detect the regions that contain human faces. This section outlines the key configuration parameters that influence how the face detection model processes input images and optimizes detection results.
At this path, you can define a custom configuration in **JSON format**. Custom configurations include only the parameters you wish to override from the default configuration.

### BLOB_FROM_IMAGE_SCALE_FACTOR
#### Details:
- Parameters not included in the custom configuration will default to the values from the global configuration and will be used by the system in subsequent calculations. This ensures that all required parameters are available, even if they are not explicitly defined in the custom configuration.
- A custom configuration can be associated with a specific deduplication set via:

Specifies the scaling factor applied to all pixel values when converting an image to a blob. Mostly it equals 1.0 for no scaling or 1.0/255.0 and normalizing to the [0, 1] range.
Home › API › Deduplication sets › specific_deduplication_set

Remember that scaling factor is also applied to mean values. Both scaling factor and mean values must be the same for the training and inference to get the correct results.
- If no custom configuration is associated with a deduplication set, the system will entirely use the **default configuration** for that set.

### BLOB_FROM_IMAGE_MEAN_VALUES
- Parameter names in custom configurations are case-sensitive and must be in lowercase.

Specifies the mean BGR values used in image preprocessing to normalize pixel values by subtracting the mean values of the training dataset. This helps in reducing model bias and improving accuracy.
#### Example Custom Configuration:
To override only the **FACE_DISTANCE_THRESHOLD** parameter, your custom configuration can look like this:

The specified mean values are subtracted from each channel (Blue, Green, Red) of the input image.

Remember that mean values are also applied to scaling factor. Both scaling factor and mean values must be the same for the training and inference to get the correct results.

### FACE_DETECTION_CONFIDENCE

Specifies the minimum confidence score required for a detected face to be considered valid. Detections with confidence scores below this threshold are discarded as likely false positives.

### NMS_THRESHOLD

Specifies the Intersection over Union (IoU) threshold used in Non-Maximum Suppression (NMS) to filter out overlapping bounding boxes. If the IoU between two boxes exceeds this threshold, the box with the lower confidence score is suppressed. Lower values result in fewer, more distinct boxes; higher values allow more overlapping boxes to remain.

## Face Recognition

This component builds on face detection to identify and differentiate between individual faces. This involves generating face encodings, which are numerical representations of the unique facial features used for recognition. These encodings can then be compared to determine if two images contain the same person or to find matches in a database of known faces.

### FACE_ENCODINGS_NUM_JITTERS

Specifies the number of times to re-sample the face when calculating the encoding. Higher values increase accuracy but are computationally more expensive and slower. For example, setting 'num_jitters' to 100 makes the process 100 times slower.

### FACE_ENCODINGS_MODEL

Specifies the model type used for encoding face landmarks. It can be either 'small' which is faster and only 5 key facial landmarks, or 'large' which is more precise and identifies 68 key facial landmarks but requires more computational resources.


## Duplicate Finder

This component is responsible for identifying duplicate images in the system by comparing face embeddings. These embeddings are numerical representations of facial features generated during the face recognition process. By calculating the distance between the embeddings of different images, the system can determine whether two images contain the same person, helping in the identification and removal of duplicates or grouping similar faces together.

### FACE_DISTANCE_THRESHOLD

Specifies the maximum allowable distance between two face embeddings for them to be considered a match. It helps determine if two faces belong to the same person by setting a threshold for similarity. Lower values result in stricter matching, while higher values allow for more lenient matches.
```json
{
"FACE_DISTANCE_THRESHOLD": 0.7
}
```

This flexibility allows you to define tailored configurations for specific deduplication sets, ensuring precise control over application behavior while preserving the stability of default settings.
68 changes: 32 additions & 36 deletions docs/src/setup.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ tags:

## Prerequisites

This project utilizes [PDM](https://pdm-project.org/) as the package manager for managing Python dependencies and environments.
This project utilizes [UV](https://docs.astral.sh/uv/) as the package manager for managing Python dependencies and environments.

To successfully set up and run this project, ensure that you have the following components in place:

Expand All @@ -23,76 +23,72 @@ Essential steps for verifying and configuring the environment settings required

### Display the Current Configuration

$ docker run -it -t unicef/hope-dedupe-engine:release-0.1 django-admin env
$ docker run -it -t unicef/hope-dedupe-engine:release-0.2 django-admin env

### Mandatory Environment Variables
Check Environment Variables

$ docker run -it -t unicef/hope-dedupe-engine:release-0.1 django-admin env --check
$ docker run -it -t unicef/hope-dedupe-engine:release-0.2 django-admin env --check

Ensure the following environment variables are properly configured:

DATABASE_URL
SECRET_KEY

ADMIN_EMAIL
ADMIN_PASSWORD
ALLOWED_HOSTS
CACHE_URL
CELERY_BROKER_URL
MEDIA_ROOT
STATIC_ROOT
DEFAULT_ROOT
FILE_STORAGE_DNN
CSRF_COOKIE_SECURE
DATABASE_URL
DEEPFACE_HOME
FILE_STORAGE_HOPE
FILE_STORAGE_STATIC
FILE_STORAGE_MEDIA
MEDIA_ROOT
SECRET_KEY
STATIC_ROOT

### Variables Breakdown

Detailed information about the required environment variables is provided for clarity and proper configuration.

#### Operational

##### DATABASE_URL
The URL for the database connection. *Example:* `postgres://hde:password@db:5432/hope_dedupe_engine`
##### ADMIN_EMAIL
The email address for the admin user that can access the Django admin panel with full privileges. *Example:* `[email protected]`

##### SECRET_KEY
A secret key for the Django installation. *Example:* `django-insecure-pretty-strong`
##### ALLOWED_HOSTS
A list of host/domain names that the application can serve. *Example:* `['*']`

##### ADMIN_PASSWORD
The password for the admin user. *Example:* `123`

##### CACHE_URL
The URL for the cache server. *Example:* `redis://redis:6379/1`

##### CELERY_BROKER_URL
The URL for the Celery broker. *Example:* `redis://redis:6379/9`

##### CSRF_COOKIE_SECURE
A boolean value that determines whether the CSRF cookie should be secure. *Example:* `False`

##### DATABASE_URL
The URL for the database connection. *Example:* `postgres://hde:password@db:5432/hope_dedupe_engine`

##### SECRET_KEY
A secret key for the Django installation. *Example:* `django-insecure-pretty-strong`

#### Root directories

##### DEFAULT_ROOT
The root directory for locally stored files. *Example:* `/var/hope_dedupe_engine/default`
#### DEEPFACE_HOME
The root directory for storing model weights for the DeepFace library. *Example:* `/var/hope_dedupe_engine/deepface`

##### MEDIA_ROOT
The root directory for media files. *Example:* `/var/hope_dedupe_engine/media`

##### STATIC_ROOT
The root directory for static files. *Example:* `/var/hope_dedupe_engine/static`

#### Storages

##### FILE_STORAGE_DEFAULT
This backend is used for storing locally downloaded DNN model files and encoded data.
```
FILE_STORAGE_DEFAULT=django.core.files.storage.FileSystemStorage
```
##### FILE_STORAGE_DNN
This backend is dedicated to storing DNN model files. Ensure that the following two files are present in this storage:

1. *deploy.prototxt.txt*: Defines the model architecture.
2. *res10_300x300_ssd_iter_140000.caffemodel*: Contains the pre-trained model weights.

The current process involves downloading files from a [GitHub repository](https://github.com/sr6033/face-detection-with-OpenCV-and-DNN) and saving them to this specific Azure Blob Storage using command `django-admin upgrade --with-dnn-setup`, or the specialized`django-admin dnnsetup` command .
In the future, an automated pipeline related to model training could handle file updates.

The storage configuration is as follows:
```
FILE_STORAGE_DNN="storages.backends.azure_storage.AzureStorage?account_name=<account_name>&account_key=<account_key>&overwrite_files=true&azure_container=dnn"
```
#### Storage backends

##### FILE_STORAGE_HOPE
This backend is used for storing HOPE dataset images. It should be configured as read-only for the service.
Expand Down
63 changes: 0 additions & 63 deletions src/hope_dedup_engine/apps/api/deduplication/config.py

This file was deleted.

3 changes: 1 addition & 2 deletions src/hope_dedup_engine/apps/api/deduplication/process.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,12 +4,11 @@

from celery import chord, shared_task

from hope_dedup_engine.apps.api.deduplication.config import DeduplicationSetConfig

# from hope_dedup_engine.apps.api.deduplication.registry import ( # DuplicateFinder,; DuplicateKeyPair,
# get_finders,
# )
from hope_dedup_engine.apps.api.models import DedupJob, DeduplicationSet, Finding
from hope_dedup_engine.apps.api.models.config import DeduplicationSetConfig
from hope_dedup_engine.apps.api.utils.notification import send_notification

# from hope_dedup_engine.apps.api.utils.progress import track_progress_multi
Expand Down
7 changes: 0 additions & 7 deletions src/hope_dedup_engine/apps/api/forms.py

This file was deleted.

72 changes: 71 additions & 1 deletion src/hope_dedup_engine/apps/api/models/config.py
Original file line number Diff line number Diff line change
@@ -1,11 +1,81 @@
from dataclasses import dataclass, field
from typing import Any, Self
from uuid import UUID

from django.db import models

from constance import config as constance_cfg

from hope_dedup_engine.apps.api.models.deduplication import DeduplicationSet
from hope_dedup_engine.apps.api.validators import validate_constance_config


class Config(models.Model):
name = models.CharField(
max_length=128, unique=True, null=True, blank=True, db_index=True
)
settings = models.JSONField(default=dict, null=True, blank=True)
settings = models.JSONField(
default=dict, null=True, blank=True, validators=[validate_constance_config]
)

def __str__(self) -> str:
return self.name or f"ID: {self.pk}"

def clean(self):

super().clean()


@dataclass
class ModelOptions:
model_name: str = field(default_factory=lambda: constance_cfg.MODEL_NAME)
detector_backend: str = field(
default_factory=lambda: constance_cfg.DETECTOR_BACKEND
)

def update(self, overrides: dict[str, Any]) -> None:
for k, v in overrides.items():
if hasattr(self, k):
setattr(self, k, v)


@dataclass
class RepresentOptions(ModelOptions):
pass


@dataclass
class VerifyOptions(ModelOptions):
threshold: float = field(
default_factory=lambda: constance_cfg.FACE_DISTANCE_THRESHOLD
)
silent: bool = True


@dataclass
class DeduplicationSetConfig:
deduplication_set_id: UUID | None = None
encoding: RepresentOptions = field(default_factory=RepresentOptions)
deduplicate: VerifyOptions = field(default_factory=VerifyOptions)

def update(self, overrides: dict[str, Any]) -> None:
if not isinstance(overrides, dict):
raise ValueError("Overrides values must be a dictionary.")
for k, v in overrides.items():
match k:
case "model_name" | "detector_backend":
self.encoding.update({k: v})
self.deduplicate.update({k: v})
case "face_distance_threshold":
self.deduplicate.update({"threshold": v})
case _ if hasattr(self, k):
setattr(self, k, v)
case _:
raise KeyError(f"Unknown config key: {k}")

@classmethod
def from_deduplication_set(cls, deduplication_set: DeduplicationSet) -> Self:
instance = cls(deduplication_set_id=deduplication_set.pk)
if deduplication_set.config:
instance.update(deduplication_set.config.settings)
return instance
Loading

0 comments on commit 7b3eff1

Please sign in to comment.