chg ! deduplicationset config

unicef · Jan 21, 2025 · 7b3eff1 · 7b3eff1
1 parent 2cba530
commit 7b3eff1
Show file tree

Hide file tree

Showing 14 changed files with 291 additions and 298 deletions.
diff --git a/docs/src/did/config.md b/docs/src/did/config.md
@@ -1,68 +1,49 @@
-The configuration can be managed directly through the **admin panel**, which provides a simple way to modify settings without changing the codebase. Navigate to:
+### Default Configuration
 
-    Home › Constance › Config
+The default configuration can be managed directly through the **admin panel**, providing a simple way to modify settings without changing the codebase. Navigate to:
 
-Here, you will find all the configurable settings that affect the behavior of the system, allowing for quick adjustments and better control over application behavior.
+    Home › Constance › Config  
 
-## Deep neural networks (DNN)
+This section contains the **default configuration**, which applies to the system globally. It defines the base settings for the application and ensures consistent behavior across all deduplication sets.  
 
-The deep learning component of the system is crucial for performing advanced inference tasks, including **face detection**, **face recognition**, and **finding duplicate images** using a pre-trained model. These tasks are fundamental to ensuring the accuracy and efficiency of the system in identifying and managing images.
+#### Key Parameters:  
 
-This component relies on **Convolutional Neural Networks (CNNs)**, a type of deep learning model particularly well-suited for processing visual data. CNNs are used to automatically extract relevant features from images, such as facial landmarks and distinctive patterns, without the need for manual feature engineering.
+- **MODEL_NAME**  
+  Specifies the face recognition model to be used for **encoding** face landmarks.  
 
-### DNN_BACKEND
+- **DETECTOR_BACKEND**  
+  Specifies the face detector backend to be used for **detecting** faces in images.  
 
-Specifies the computation backend to be used by [OpenCV](https://github.com/opencv/opencv) library for deep learning inference.
+- **FACE_DISTANCE_THRESHOLD**  
+  Specifies the maximum allowable **distance** between two face embeddings for them to be considered a match. This **tolerance threshold** is crucial for assessing whether two faces belong to the same individual. Lower values result in stricter matching, while higher values allow for more lenient matches.  
 
-### DNN_TARGET
+---
 
-Specifies the target device on which [OpenCV](https://github.com/opencv/opencv) library will perform the deep learning computations.
+### Custom Configuration
 
+Custom configurations can be created for specific **deduplication sets** through the **admin panel** by navigating to:  
 
-## Face Detection
+    Home › API › Configs  
 
-This component is responsible for locating and identifying faces in images. It uses advanced deep learning algorithms to scan images and detect the regions that contain human faces. This section outlines the key configuration parameters that influence how the face detection model processes input images and optimizes detection results.
+At this path, you can define a custom configuration in **JSON format**. Custom configurations include only the parameters you wish to override from the default configuration.  
 
-### BLOB_FROM_IMAGE_SCALE_FACTOR
+#### Details:
+- Parameters not included in the custom configuration will default to the values from the global configuration and will be used by the system in subsequent calculations. This ensures that all required parameters are available, even if they are not explicitly defined in the custom configuration.  
+- A custom configuration can be associated with a specific deduplication set via:  
 
-Specifies the scaling factor applied to all pixel values when converting an image to a blob. Mostly it equals 1.0 for no scaling or 1.0/255.0 and normalizing to the [0, 1] range.
+        Home › API › Deduplication sets › specific_deduplication_set  
 
-Remember that scaling factor is also applied to mean values. Both scaling factor and mean values must be the same for the training and inference to get the correct results.
+- If no custom configuration is associated with a deduplication set, the system will entirely use the **default configuration** for that set.  
 
-### BLOB_FROM_IMAGE_MEAN_VALUES
+- Parameter names in custom configurations are case-sensitive and must be in lowercase.
 
-Specifies the mean BGR values used in image preprocessing to normalize pixel values by subtracting the mean values of the training dataset. This helps in reducing model bias and improving accuracy.
+#### Example Custom Configuration:  
+To override only the **FACE_DISTANCE_THRESHOLD** parameter, your custom configuration can look like this:  
 
-The specified mean values are subtracted from each channel (Blue, Green, Red) of the input image.
-
-Remember that mean values are also applied to scaling factor. Both scaling factor and mean values must be the same for the training and inference to get the correct results.
-
-### FACE_DETECTION_CONFIDENCE
-
-Specifies the minimum confidence score required for a detected face to be considered valid. Detections with confidence scores below this threshold are discarded as likely false positives.
-
-### NMS_THRESHOLD
-
-Specifies the Intersection over Union (IoU) threshold used in Non-Maximum Suppression (NMS) to filter out overlapping bounding boxes. If the IoU between two boxes exceeds this threshold, the box with the lower confidence score is suppressed. Lower values result in fewer, more distinct boxes; higher values allow more overlapping boxes to remain.
-
-## Face Recognition
-
-This component builds on face detection to identify and differentiate between individual faces. This involves generating face encodings, which are numerical representations of the unique facial features used for recognition. These encodings can then be compared to determine if two images contain the same person or to find matches in a database of known faces.
-
-### FACE_ENCODINGS_NUM_JITTERS
-
-Specifies the number of times to re-sample the face when calculating the encoding. Higher values increase accuracy but are computationally more expensive and slower. For example, setting 'num_jitters' to 100 makes the process 100 times slower.
-
-### FACE_ENCODINGS_MODEL
-
-Specifies the model type used for encoding face landmarks. It can be either 'small' which is faster and  only 5 key facial landmarks, or 'large' which is more precise and identifies 68 key facial landmarks but requires more computational resources.
-
-
-## Duplicate Finder
-
-This component is responsible for identifying duplicate images in the system by comparing face embeddings. These embeddings are numerical representations of facial features generated during the face recognition process. By calculating the distance between the embeddings of different images, the system can determine whether two images contain the same person, helping in the identification and removal of duplicates or grouping similar faces together.
-
-### FACE_DISTANCE_THRESHOLD
-
-Specifies the maximum allowable distance between two face embeddings for them to be considered a match. It helps determine if two faces belong to the same person by setting a threshold for similarity. Lower values result in stricter matching, while higher values allow for more lenient matches.
+```json
+{
+  "FACE_DISTANCE_THRESHOLD": 0.7
+}
+```
 
+This flexibility allows you to define tailored configurations for specific deduplication sets, ensuring precise control over application behavior while preserving the stability of default settings.
diff --git a/docs/src/setup.md b/docs/src/setup.md
@@ -5,7 +5,7 @@ tags:
 
 ## Prerequisites
 
-This project utilizes [PDM](https://pdm-project.org/) as the package manager for managing Python dependencies and environments. 
+This project utilizes [UV](https://docs.astral.sh/uv/) as the package manager for managing Python dependencies and environments. 
 
 To successfully set up and run this project, ensure that you have the following components in place:
 
@@ -23,76 +23,72 @@ Essential steps for verifying and configuring the environment settings required
 
 ### Display the Current Configuration
 
-    $ docker run -it -t  unicef/hope-dedupe-engine:release-0.1 django-admin env
+    $ docker run -it -t  unicef/hope-dedupe-engine:release-0.2 django-admin env
 
 ### Mandatory Environment Variables
 Check Environment Variables
 
-    $ docker run -it -t  unicef/hope-dedupe-engine:release-0.1 django-admin env --check
+    $ docker run -it -t  unicef/hope-dedupe-engine:release-0.2 django-admin env --check
 
 Ensure the following environment variables are properly configured:
-
-    DATABASE_URL
-    SECRET_KEY
+
+    ADMIN_EMAIL
+    ADMIN_PASSWORD
+    ALLOWED_HOSTS
     CACHE_URL
     CELERY_BROKER_URL
-    MEDIA_ROOT
-    STATIC_ROOT
-    DEFAULT_ROOT
-    FILE_STORAGE_DNN
+    CSRF_COOKIE_SECURE
+    DATABASE_URL
+    DEEPFACE_HOME
     FILE_STORAGE_HOPE
     FILE_STORAGE_STATIC
     FILE_STORAGE_MEDIA
+    MEDIA_ROOT
+    SECRET_KEY
+    STATIC_ROOT
 
 ### Variables Breakdown
 
 Detailed information about the required environment variables is provided for clarity and proper configuration.
 
 #### Operational
 
-##### DATABASE_URL
-The URL for the database connection. *Example:* `postgres://hde:password@db:5432/hope_dedupe_engine`
+##### ADMIN_EMAIL
+The email address for the admin user that can access the Django admin panel with full privileges. *Example:* `[email protected]`
 
-##### SECRET_KEY
-A secret key for the Django installation. *Example:* `django-insecure-pretty-strong`
+##### ALLOWED_HOSTS
+A list of host/domain names that the application can serve. *Example:* `['*']`
+
+##### ADMIN_PASSWORD
+The password for the admin user. *Example:* `123`
 
 ##### CACHE_URL
 The URL for the cache server. *Example:* `redis://redis:6379/1`
 
 ##### CELERY_BROKER_URL
 The URL for the Celery broker. *Example:* `redis://redis:6379/9`
 
+##### CSRF_COOKIE_SECURE
+A boolean value that determines whether the CSRF cookie should be secure. *Example:* `False`
+
+##### DATABASE_URL
+The URL for the database connection. *Example:* `postgres://hde:password@db:5432/hope_dedupe_engine`
+
+##### SECRET_KEY
+A secret key for the Django installation. *Example:* `django-insecure-pretty-strong`
+
 #### Root directories
 
-##### DEFAULT_ROOT
-The root directory for locally stored files. *Example:* `/var/hope_dedupe_engine/default`
+#### DEEPFACE_HOME
+The root directory for storing model weights for the DeepFace library. *Example:* `/var/hope_dedupe_engine/deepface`
 
 ##### MEDIA_ROOT
 The root directory for media files. *Example:* `/var/hope_dedupe_engine/media`
 
 ##### STATIC_ROOT
 The root directory for static files. *Example:* `/var/hope_dedupe_engine/static`
 
-#### Storages
-
-##### FILE_STORAGE_DEFAULT
-This backend is used for storing locally downloaded DNN model files and encoded data.
-    ```
-    FILE_STORAGE_DEFAULT=django.core.files.storage.FileSystemStorage
-    ```
-#####  FILE_STORAGE_DNN
-This backend is dedicated to storing DNN model files. Ensure that the following two files are present in this storage:
-
-1. *deploy.prototxt.txt*: Defines the model architecture.
-2. *res10_300x300_ssd_iter_140000.caffemodel*: Contains the pre-trained model weights.
-
-The current process involves downloading files from a [GitHub repository](https://github.com/sr6033/face-detection-with-OpenCV-and-DNN) and saving them to this specific Azure Blob Storage using command `django-admin upgrade --with-dnn-setup`, or the specialized`django-admin dnnsetup` command .
-In the future, an automated pipeline related to model training could handle file updates.
-
-The storage configuration is as follows:
-```
-FILE_STORAGE_DNN="storages.backends.azure_storage.AzureStorage?account_name=<account_name>&account_key=<account_key>&overwrite_files=true&azure_container=dnn"
-```
+#### Storage backends
 
 ##### FILE_STORAGE_HOPE
 This backend is used for storing HOPE dataset images. It should be configured as read-only for the service.

diff --git a/src/hope_dedup_engine/apps/api/deduplication/config.py b/src/hope_dedup_engine/apps/api/deduplication/config.py
diff --git a/src/hope_dedup_engine/apps/api/deduplication/process.py b/src/hope_dedup_engine/apps/api/deduplication/process.py
@@ -4,12 +4,11 @@
 
 from celery import chord, shared_task
 
-from hope_dedup_engine.apps.api.deduplication.config import DeduplicationSetConfig
-
 # from hope_dedup_engine.apps.api.deduplication.registry import (  # DuplicateFinder,; DuplicateKeyPair,
 #     get_finders,
 # )
 from hope_dedup_engine.apps.api.models import DedupJob, DeduplicationSet, Finding
+from hope_dedup_engine.apps.api.models.config import DeduplicationSetConfig
 from hope_dedup_engine.apps.api.utils.notification import send_notification
 
 # from hope_dedup_engine.apps.api.utils.progress import track_progress_multi

diff --git a/src/hope_dedup_engine/apps/api/forms.py b/src/hope_dedup_engine/apps/api/forms.py
diff --git a/src/hope_dedup_engine/apps/api/models/config.py b/src/hope_dedup_engine/apps/api/models/config.py
@@ -1,11 +1,81 @@
+from dataclasses import dataclass, field
+from typing import Any, Self
+from uuid import UUID
+
 from django.db import models
 
+from constance import config as constance_cfg
+
+from hope_dedup_engine.apps.api.models.deduplication import DeduplicationSet
+from hope_dedup_engine.apps.api.validators import validate_constance_config
+
 
 class Config(models.Model):
     name = models.CharField(
         max_length=128, unique=True, null=True, blank=True, db_index=True
     )
-    settings = models.JSONField(default=dict, null=True, blank=True)
+    settings = models.JSONField(
+        default=dict, null=True, blank=True, validators=[validate_constance_config]
+    )
 
     def __str__(self) -> str:
         return self.name or f"ID: {self.pk}"
+
+    def clean(self):
+
+        super().clean()
+
+
+@dataclass
+class ModelOptions:
+    model_name: str = field(default_factory=lambda: constance_cfg.MODEL_NAME)
+    detector_backend: str = field(
+        default_factory=lambda: constance_cfg.DETECTOR_BACKEND
+    )
+
+    def update(self, overrides: dict[str, Any]) -> None:
+        for k, v in overrides.items():
+            if hasattr(self, k):
+                setattr(self, k, v)
+
+
+@dataclass
+class RepresentOptions(ModelOptions):
+    pass
+
+
+@dataclass
+class VerifyOptions(ModelOptions):
+    threshold: float = field(
+        default_factory=lambda: constance_cfg.FACE_DISTANCE_THRESHOLD
+    )
+    silent: bool = True
+
+
+@dataclass
+class DeduplicationSetConfig:
+    deduplication_set_id: UUID | None = None
+    encoding: RepresentOptions = field(default_factory=RepresentOptions)
+    deduplicate: VerifyOptions = field(default_factory=VerifyOptions)
+
+    def update(self, overrides: dict[str, Any]) -> None:
+        if not isinstance(overrides, dict):
+            raise ValueError("Overrides values must be a dictionary.")
+        for k, v in overrides.items():
+            match k:
+                case "model_name" | "detector_backend":
+                    self.encoding.update({k: v})
+                    self.deduplicate.update({k: v})
+                case "face_distance_threshold":
+                    self.deduplicate.update({"threshold": v})
+                case _ if hasattr(self, k):
+                    setattr(self, k, v)
+                case _:
+                    raise KeyError(f"Unknown config key: {k}")
+
+    @classmethod
+    def from_deduplication_set(cls, deduplication_set: DeduplicationSet) -> Self:
+        instance = cls(deduplication_set_id=deduplication_set.pk)
+        if deduplication_set.config:
+            instance.update(deduplication_set.config.settings)
+        return instance