Skip to content

The aerOS AI Local Execution is a service that enables execution of federated learning tasks on Infrastructure Elements

License

Notifications You must be signed in to change notification settings

aerOS-Project/AI-local-execution

Repository files navigation

AI Local Execution

The AI Local Execution is a service that enables execution of federated learning tasks on Infrastructure Elements. It is part of a set of AI services (with AI Task Controller having components: FL Training Collector, FL Repository, FL Controller, FL API, FL GUI) that should be deployed in order to run a FL task. Ideally all services should be deployed but it is possible to run the task directly on AI Local Execution configuring it through its dedicated REST API. The AI Local Execution encapsulates the functionalities of a federated learning (FL) client by maintaining a connection with the FL Controller (FL Orchestrator; GUI and monitoring), connecting to the training initiated by the FL server (FL Training Collector), periodically providing it with local weights and obtaining new global weights, as well as downloading any necessary components from the FL Repository (database).

Prerequisite is that FL Repository is running and is properly initiated with any models or transformations that AI Local Execution might be needing.

AI Local Execution encapsulates functionalities of a federated learning (FL) client connecting to the training initiated by the FL server (FL Training Collector), periodically providing it with local weights and obtaining new global weights, as well as downloading any necessary components from the FL Repository (database).

Beyond the classic functionality of an FL client, AI Local Execution enables the local inference deployment of a selected model (that can function as a standalone container). It uses flexible configurations, basic format verification and pluggable components.

Getting started / Use

AI Local Execution should be deployed along with AI Task Controller components to provide full functionality of FL training and local inference. The respective services expose REST API to allow communication between themselves and with the external parties.

AI Local Execution can be run using docker-compose or can be deployed on Kubernetes cluster with a dedicated Helm chart.

How to build, install, or deploy it

Running using Docker (locally)

The command can be used in the terminal to build a new Docker image: USER_INDEX=1 FL_LOCAL_OP_DATA_FOLDER="./data" docker compose up --force-recreate --build -d Alternatively, the script can be used to do this automatically. sh start-local-docker.sh [NUMBER OF LE] Setting [NUMBER OF LE] to 1 will run one AI Local Execution instance.

When the container is built and run, it should be checked with command: docker ps. For one instance of AI Local Exceution the output should look like this:

CONTAINER ID IMAGE ... STATUS PORTS NAMES
8c4744c648c0 aeros/ai_local_executor:latest ... Up 5 minutes 0.0.0.0:9050->9050/tcp, 0.0.0.0:30080->80/tcp, 0.0.0.0:9003->9000/tcp appv0-local_executor-1
24964eafadc9 aeros/ai_local_executor_inference:latest ... Up 5 minutes 0.0.0.0:9001->9000/tcp, 0.0.0.0:50052->50051/tcp appv0-inferenceapp-1
57de2b0a09e4 appv0-db ... Up 5 minutes 27017/tcp db-0

The Swagger documentation of the REST API should be visible under url: http://127.0.0.1:9050/docs (if default port configuration is preserved)

Note 1: When running using Docker make sure that all other containers (FL Local Execution, FL Repository, FL Training Collector) that are to be used to run FL task are in the same network.

The following commands can be used:

docker inspect -f '{{range $key, $value := .NetworkSettings.Networks}}{{$key}} {{end}}' [CONTAINER_ID] - check network of a given container

docker network inspect -f '{{range .Containers}}{{.Name}} {{end}}' [NETWORK e.g. appv0_default] - check all containters in a given network

docker network connect [NETWORK] [CONTAINER_ID] - add a given container to a given network

Note 2: Before starting the training at least FL Repository should be properly initiated with any required transformations and model metadata and started. For more details see FL Repository documentation (TBD).

Deployment on Kubernetes

The AI Local Execution service has been developed with the assumption that it will be deployed on a Kubernetes cluster with a dedicated Helm chart. To do so, run helm install <deployment name> ailocalexecution. If multiple AI Local Executions should be deployed in one Kubernetes cluster, different names for all of the deployments should be chosen. If only the inference component should be deployed, run helm install fllocalexecutionlocal fllocalexecution --set inferenceapp.fullDeployment.enabled=false.

To make sure that the service has been configured properly, check the 3 ConfigMaps that are deployed alongside the service. Their names change depending on the name od the deployment.

  1. starts with ailocalexec-config-map contains the environmental variables necessary to deploy the AI Local Execution instance. Check especially the fields of REPOSITORY\_ADDRESS (the address of the FL Repository instance), CONTROLLER\_SVR\_ADDRESS (the address of the FL Controller).

  2. starts with fltraining-config-map - describes the configuration necessary to run the training app component with pluggable transformations.

  3. starts with flinference-config-map - serves to flexibly set and change the configuration for the inference component.

Testing

  • POST /model/ Receive new training model metadata for local storage
  • PUT /model/{name}/{version} Update the weights and structure of the locally stored training model.
  • GET /job/status Get the statuses of the current jobs.
  • GET /job/total Get the number of currently running jobs.
  • GET /capabilities Get the computational capabilities of the machine that AI Local Execution is running on.
  • GET /format Get the format of the data that a given instance has currently access to.

In order to initiate the training, a JSON encompassing the following configuration should be sent to the endpoint shown below. The most important available keys and their meaning will be explained further down.

POST /job/config/{training_id}/

{
  "client_type_id": "string",
  "server_address": "string",
  "eval_metrics": [
    "string"
  ],
  "eval_func": "string",
  "num_classes": 0,
  "num_rounds": 0,
  "shape": [
    0
  ],
  "training_id": 0,
  "model_name": "string",
  "model_version": "string",
  "config": [
    {
      "config_id": "string",
      "batch_size": 0,
      "steps_per_epoch": 0,
      "epochs": 0,
      "learning_rate": 0
    }
  ],
  "optimizer_config": {
    "optimizer": "string",
    "lr": 0,
    "rho": 0,
    "eps": 0,
    "foreach": true,
    "maximize": true,
    "lr_decay": 0,
    "betas": [
      "string",
      "string"
    ],
    "etas": [
      "string",
      "string"
    ],
    "step_sizes": [
      "string",
      "string"
    ],
    "lambd": 0,
    "alpha": 0,
    "t0": 0,
    "max_iter": 0,
    "max_eval": 0,
    "tolerance_grad": 0,
    "tolerance_change": 0,
    "history_size": 0,
    "line_search_fn": "string",
    "momentum_decay": 0,
    "dampening": 0,
    "centered": true,
    "nesterov": true,
    "momentum": 0,
    "weight_decay": 0,
    "amsgrad": true,
    "learning_rate": 0,
    "name": "string",
    "clipnorm": 0,
    "global_clipnorm": 0,
    "use_ema": true,
    "ema_momentum": 0,
    "ema_overwrite_frequency": 0,
    "jit_compile": true,
    "epsilon": 0,
    "clipvalue": 0,
    "initial_accumulator_value": 0,
    "beta_1": 0,
    "beta_2": 0,
    "beta_2_decay": 0,
    "epsilon_1": 0,
    "epsilon_2": 0,
    "learning_rate_power": 0,
    "l1_regularization_strength": 0,
    "l2_regularization_strength": 0,
    "l2_shrinkage_regularization_strength": 0,
    "beta": 0
  },
  "scheduler_config": {
    "scheduler": "string",
    "step_size": 0,
    "gamma": 0,
    "last_epoch": 0,
    "verbose": true,
    "milestones": [
      0
    ],
    "factor": 0,
    "total_iters": 0,
    "start_factor": 0,
    "end_factor": 0,
    "monitor": "string",
    "min_delta": 0,
    "patience": 0,
    "mode": "string",
    "baseline": 0,
    "restore_best_weights": true,
    "start_from_epoch": 0,
    "cooldown": 0,
    "min_lr": 0
  },
  "warmup_config": {
    "scheduler": "string",
    "warmup_iters": 0,
    "warmup_epochs": 0,
    "warmup_factor": 0,
    "scheduler_conf": {
      "scheduler": "string",
      "step_size": 0,
      "gamma": 0,
      "last_epoch": 0,
      "verbose": true,
      "milestones": [
        0
      ],
      "factor": 0,
      "total_iters": 0,
      "start_factor": 0,
      "end_factor": 0,
      "monitor": "string",
      "min_delta": 0,
      "patience": 0,
      "mode": "string",
      "baseline": 0,
      "restore_best_weights": true,
      "start_from_epoch": 0,
      "cooldown": 0,
      "min_lr": 0
    }
  },
  "privacy-mechanisms": {}
}

The definitions:

  • client_type_id Specifies the ID of the client. Allows to bypass the plugability modules for the Pytorch builder with the keyword " base" for testing purposes.

  • server_address The address of the Flower server that the FL client should try to connect to.

  • eval_metrics The evaluation metrics which will be gathered through the evaluation process by the FL client.

  • eval_func The evaluation function that the model will use as the loss throughout the training process.

  • num_classes The number of classes in classification problems.

  • num_rounds The number of rounds that the training should run for.

  • shape The shape of the data. Currently, this parameter is recommended to be changed through the ConfigMaps instead.

  • training_id The id of the training process being conducted.

  • model_name The name of the model that will be used in the training. The name should be the same as the one stored in FL Repository.

  • model_version The version of the model that will be used in the training. The name should be the same as the one stored in the FL Repository.

  • config The configuration specifying how the FL training process will be conducted on the client, containing important terms such as the batch_size or learning rate.

  • optimizer_config The configuration of the optimizer.

  • optimizer

    For the Keras model and client, the optimizer can be one of:

     "sgd": tf.keras.optimizers.SGD,
     "rmsprop": tf.keras.optimizers.RMSprop,
     "adam": tf.keras.optimizers.Adam,
     "adadelta": tf.keras.optimizers.Adadelta,
     "adagrad": tf.keras.optimizers.Adagrad,
     "adamax": tf.keras.optimizers.Adamax,
     "nadam": tf.keras.optimizers.Nadam,
     "ftrl": tf.keras.optimizers.Ftrl
    

    For the PyTorch model and client, the optimizer can be one of:

     "adadelta": torch.optim.Adadelta,
     "adagrad": torch.optim.Adagrad,
     "adam": torch.optim.Adam,
     "adamw": torch.optim.AdamW,
     "sparseadam": torch.optim.SparseAdam,
     "adamax": torch.optim.Adamax,
     "asgd": torch.optim.ASGD,
     "lbfgs": torch.optim.LBFGS,
     "nadam": torch.optim.NAdam,
     "radam": torch.optim.RAdam,
     "rmsprop": torch.optim.RMSprop,
     "rprop": torch.optim.Rprop,
     "sgd": torch.optim.SGD
    

    Other fields indicate the arguments that should be passed to the optimizer.

  • scheduler_config The configuration of the scheduler.

  • scheduler

    For the Keras model and client, the scheduler (or here, a more appropriate name would be a Keras callback) can be one of:

     "earlystopping": tf.keras.callbacks.EarlyStopping,
     "reducelronplateau": tf.keras.callbacks.ReduceLROnPlateau,
     "terminateonnan": tf.keras.callbacks.TerminateOnNaN
    

    For the Pytorch model and client, the scheduler can be one of:

     "lambdalr": torch.optim.lr_scheduler.LambdaLR,
     "multiplicativelr": torch.optim.lr_scheduler.MultiplicativeLR,
     "steplr": torch.optim.lr_scheduler.StepLR,
     "multisteplr": torch.optim.lr_scheduler.MultiStepLR,
     "constantlr": torch.optim.lr_scheduler.ConstantLR,
     "linearlr": torch.optim.lr_scheduler.LinearLR,
     "exponentiallr": torch.optim.lr_scheduler.ExponentialLR,
     "cosineannealinglr": torch.optim.lr_scheduler.CosineAnnealingLR,
     "chainedscheduler": torch.optim.lr_scheduler.ChainedScheduler,
     "sequentiallr": torch.optim.lr_scheduler.SequentialLR,
     "reducelronplateau": torch.optim.lr_scheduler.ReduceLROnPlateau,
     "cycliclr": torch.optim.lr_scheduler.CyclicLR,
     "onecyclelr": torch.optim.lr_scheduler.OneCycleLR,
     "cosineannealingwarmrestarts": torch.optim.lr_scheduler.CosineAnnealingWarmRestarts
    

    Other fields indicate the arguments that should be passed to the scheduler.

  • warmup_config The configuration of an (optional) warmup. This configuration is valid only for the PyTorch builder. It specifies a special scheduler, which can be used only for a selected number of epochs to provide warmup throughout the process.

  • scheduler The name of the scheduler.

    Other fields indicate the arguments that should be passed to the scheduler.

A sample test configuration can be seen here:

{
  "client_type_id": "local1",
  "server_address": "fl-training-collector-trainingmain-1",
  "eval_metrics": [
    "accuracy"
  ],
  "eval_func": "categorical_crossentropy",
  "num_classes": 10,
  "num_rounds": 15,
  "shape": [
    32,
    32,
    3
  ],
  "training_id": "1",
  "model_name": "md_keras",
  "model_version": "v1",
  "config": [
    {
      "config_id": "min_effort",
      "batch_size": "64",
      "steps_per_epoch": "32",
      "epochs": "1",
      "learning_rate": "0.001"
    }
  ],
  "optimizer_config": {
    "optimizer": "adam",
    "learning_rate": "0.005",
    "amsgrad": "True"
  },
  "scheduler_config": {
    "scheduler": "reducelronplateau",
    "factor": "0.5",
    "min_delta": "0.0003"
  },
  "privacy-mechanisms": {}
}

Other API endpoints

  • POST /model/ Receive new training model metadata for local storage
  • PUT /model/{name}/{version} Update the weights and structure of the locally stored training model.
  • GET /job/status Get the statuses of the current jobs.
  • GET /job/total Get the number of currently running jobs.
  • GET /capabilities Get the computational capabilities of the machine that FL Local Operations is running on.
  • GET /format Get the format of the data that a given FL Local Operations instance has currently access to.

Websockets

A websocket client is running in the background of the trainingapp pod. Its purpose is to provide a continuous means of communication with the FL Controller, so that the Orchestrator knows exactly which FL Local Execution are active and can participate in training. It will try to connect with the FL Controller server via the ORCHESTRATOR_WS_ADDRESS address configured in the fllocalops-config-map ConfigMap. To appropriately change it is then enough to modify this address with kubectl edit cm and recreate the trainingapp pod.

The inference component

The inference component corresponds to the inferenceapp pod and can function as a standalone. It uses gRPC for lightweight communication. It allows for the configuration setup through the modification of configuration files located in the configurations directory (which can also be modified on the fly by changing the values in the flinference-config-map and restarting the pod), as well as the addition and subtraction of serialized objects from the (they can be accessed and changed as a Kubernetes volume or downloaded on the fly from the FL Repository in the case of data transformations and models). By default, the inference component accepts data in the form of numerical arrays of any shape and uses a TFLite model to provide lightweight and fast inference. However, it is possible to change the input shape and further details with the use of pluggability.

The inference component is, by default, installed with the rest of the Helm chart. Then it can be accessed through service fllocaloperationslocal-inferenceapp on port 50051 according to the specification located in inference_application/code/proto/basic-inference.proto.

Privacy

There are two privacy mechanisms implemented to be used by the FL System. The FL Training Collector can be configured to work with either of them, both or none of them through the use of the training configuration.

Differential Privacy

The mechanism of Adaptive Differential Privacy modifies the selected strategy by introducing noise to the local model parameters before they are sent by the client. This increases the privacy of the data on the client by obfuscating the information about its distribution. This specific implementation additionally uses adaptive clipping to reduce the balance the influence of multiple clients. The use of this privacy technique may lead to a degradation in the performance of the final model, but introduces little to none additional, computational cost.

The use of adaptive differential privacy and its specific parameters can be specified in the training configuration under the privacy_mechanisms keyword. If we include dp-adaptive in this dictionary, we can specify the parameters used by the Flower implementation under the dp-adaptive key and configure the training like this:

"privacy-mechanisms": {
  "dp-adaptive": {
  "num_sampled_clients": "1"
  }
}

Homomorphic Encryption

The mechanism of Federated Averaging with Homomorphic Encryption has been implemented from scratch using the TenSEAL library. As Homomorphic Encryption allows for the encryption of numbers such that the decrypted sum of encrypted numbers is the same as the sum of encrypted numbers (and similarly for the subtraction and multiplication). It therefore allows the FL clients to send their encrypted weights, which can then be aggregated and return as the averaged weights in the encrypted form. This ensures that in the event of a malicious server (or a malicious eavesdropper) the privacy of the clients' data remains intact.

The current implementation encrypts the parameters as a CCKS tensor (as implemented in TenSEAL), so if the user would like to generate and serialize new keys and contexts, they should be compatible with this method.

In order to generate a new set of keys, you can use the file application/generate_homomorphic_keys.py. If a new set of keys is generated, the application/src/custom_clients/hm_keys/public.text and application/src/custom_clients/hm_keys/secret.text files should be appropriately changed (and potentially modified to be a Kubernetes secret).

Attention: As an extremely computationally expensive method, it can usually be used only for the simplest of methods and datasets. Therefore it is not recommended in this implementation to use it for models more complicated than a simple Linear Regression.

Pluggable modules

The trainingapp component suports FL training with the use of Keras and Pytorch libraries out of the box. Similarly, the inferenceapp component supports the inference with the TFLite inferencer. However, it is possible to develop custom components for:

  • in the case of trainingapp:
    • FL client
    • FL model
    • FL data loader
    • FL data transformations
  • in the case of inferenceapp:
    • gRPC service along with the proto and protocompiled files
    • inferencer
    • model.

In order to deploy the image with your custom components through the use of Kubernetes volume, change the custom_setup field in values.yaml to True.

Prometheus metric monitoring

The Prometheus metrics are available for scraping on the the port 9050 under url /metrics on the trainingapp, and on the port 9000 without any additional url path changes in the inferenceapp.

Authors

The AI Local Execution service is a continuation of research conducted within Horizon 2020 ASSIST-IoT project.

Licensing

The FL Local Execution is released under the Apache 2.0 license, as we have internally concluded that we are not " offering the functionality of MongoDB, or modified versions of MongoDB, to third parties as a service". However, potential future commercial adopters should be aware that our project uses MongoDB in order to be able to accurately determine the license most applicable to their projects.

About

The aerOS AI Local Execution is a service that enables execution of federated learning tasks on Infrastructure Elements

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages