Skip to content

Commit

Permalink
[2.4] Add xgboost metrics tracking cb (#2381)
Browse files Browse the repository at this point in the history
  • Loading branch information
YuanTingHsieh authored Mar 11, 2024
1 parent 4946ac6 commit c2c3548
Show file tree
Hide file tree
Showing 27 changed files with 250 additions and 114 deletions.
21 changes: 5 additions & 16 deletions examples/advanced/random_forest/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -97,21 +97,10 @@ By default, CPU based training is used.
If the CUDA is installed on the site, tree construction and prediction can be
accelerated using GPUs.

GPUs are enabled by using :code:`gpu_hist` as :code:`tree_method` parameter.
For example,
::
"xgboost_params": {
"max_depth": 8,
"eta": 0.1,
"objective": "binary:logistic",
"eval_metric": "auc",
"tree_method": "gpu_hist",
"gpu_id": 0,
"nthread": 16
}

For GPU based training, edit `job_config_gen.sh` to change `TREE_METHOD="hist"` to `TREE_METHOD="gpu_hist"`.
Then run the `job_config_gen.sh` again to generates new job configs for GPU-based training.
In order to enable GPU accelerated training, first ensure that your machine has CUDA installed and has at least one GPU.
In `config_fed_client.json` set `"use_gpus": true` and `"tree_method": "hist"`.
Then, in `FedXGBTreeExecutor` we use the `device` parameter to map each rank to a GPU device ordinal.
If using multiple GPUs, we can map each rank to a different GPU device, however you can also map each rank to the same GPU device if using a single GPU.

## Run experiments
After you run the two scripts `data_split_gen.sh` and `jobs_gen.sh`, the experiments can be run with the NVFlare simulator.
Expand Down Expand Up @@ -162,4 +151,4 @@ AUC over first 1000000 instances is: 0.7828698775310959
AUC over first 1000000 instances is: 0.779952094937354
20_clients_square_split_scaled_lr_split_0.02_subsample
AUC over first 1000000 instances is: 0.7825360505137948
```
```
1 change: 0 additions & 1 deletion examples/advanced/random_forest/jobs_gen.sh
Original file line number Diff line number Diff line change
@@ -1,6 +1,5 @@
#!/usr/bin/env bash

# change to "gpu_hist" for gpu training
TREE_METHOD="hist"
DATA_SPLIT_ROOT="/tmp/nvflare/random_forest/HIGGS/data_splits"

Expand Down
2 changes: 1 addition & 1 deletion examples/advanced/random_forest/utils/model_validation.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ def model_validation_args_parser():
help="Total number of trees",
)
parser.add_argument(
"--tree_method", type=str, default="hist", help="tree_method for xgboost - use hist or gpu_hist for best perf"
"--tree_method", type=str, default="hist", help="tree_method for xgboost - use hist for best perf"
)
return parser

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ def job_config_args_parser():
parser.add_argument("--lr_mode", type=str, default="uniform", help="Whether to use uniform or scaled shrinkage")
parser.add_argument("--nthread", type=int, default=16, help="nthread for xgboost")
parser.add_argument(
"--tree_method", type=str, default="hist", help="tree_method for xgboost - use hist or gpu_hist for best perf"
"--tree_method", type=str, default="hist", help="tree_method for xgboost - use hist for best perf"
)
return parser

Expand Down
15 changes: 9 additions & 6 deletions examples/advanced/vertical_xgboost/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
This example shows how to use vertical federated learning with [NVIDIA FLARE](https://nvflare.readthedocs.io/en/main/index.html) on tabular data.
Here we use the optimized gradient boosting library [XGBoost](https://github.com/dmlc/xgboost) and leverage its federated learning support.

Before starting please make sure you set up a [virtual environment](../../../README.md#set-up-a-virtual-environment) and install the additional requirements:
Before starting please make sure you set up a [virtual environment](../../README.md#set-up-a-virtual-environment) and install the additional requirements:
```
python3 -m pip install -r requirements.txt
```
Expand Down Expand Up @@ -30,7 +30,7 @@ Run the following command to prepare the data splits:
### Private Set Intersection (PSI)
Since not every site will have the same set of data samples (rows), we can use PSI to compare encrypted versions of the sites' datasets in order to jointly compute the intersection based on common IDs. In this example, the HIGGS dataset does not contain unique identifiers so we add a temporary `uid_{idx}` to each instance and give each site a portion of the HIGGS dataset that includes a common overlap. Afterwards the identifiers are dropped since they are only used for matching, and training is then done on the intersected data. To learn more about our PSI protocol implementation, see our [psi example](../psi/README.md).

> **_NOTE:_** The uid can be a composition of multiple variabes with a transformation, however in this example we use indices for simplicity. PSI can also be used for computing the intersection of overlapping features, but here we give each site unique features.
> **_NOTE:_** The uid can be a composition of multiple variables with a transformation, however in this example we use indices for simplicity. PSI can also be used for computing the intersection of overlapping features, but here we give each site unique features.
Create the psi job using the predefined psi_csv template:
```
Expand Down Expand Up @@ -58,7 +58,9 @@ Lastly, we must subclass `XGBDataLoader` and implement the `load_data()` method.
By default, CPU based training is used.

In order to enable GPU accelerated training, first ensure that your machine has CUDA installed and has at least one GPU.
In `config_fed_client.json` set `"use_gpus": true` and `"tree_method": "hist"` in `xgb_params`. Then, in `FedXGBHistogramExecutor` we use the `device` parameter to map each rank to a GPU device ordinal in `xgb_params`. If using multiple GPUs, we can map each rank to a different GPU device, however you can also map each rank to the same GPU device if using a single GPU.
In `config_fed_client.json` set `"use_gpus": true` and `"tree_method": "hist"` in `xgb_params`.
Then, in `FedXGBHistogramExecutor` we use the `device` parameter to map each rank to a GPU device ordinal in `xgb_params`.
If using multiple GPUs, we can map each rank to a different GPU device, however you can also map each rank to the same GPU device if using a single GPU.

We can create a GPU enabled job using the job CLI:
```
Expand Down Expand Up @@ -87,10 +89,11 @@ The model will be saved to `test.model.json`.
## Results
Model accuracy can be visualized in tensorboard:
```
tensorboard --logdir /tmp/nvflare/vertical_xgb
tensorboard --logdir /tmp/nvflare/vertical_xgb/simulate_job/tb_events
```

An example training (pink) and validation (orange) AUC graph from running vertical XGBoost on HIGGS.
Used an intersection of 50000 samples across 5 clients each with different features, and ran for ~50 rounds due to early stopping.
An example training (pink) and validation (orange) AUC graph from running vertical XGBoost on HIGGS:
(Used an intersection of 50000 samples across 5 clients each with different features,
and ran for ~50 rounds due to early stopping.)

![Vertical XGBoost graph](./figs/vertical_xgboost_graph.png)
4 changes: 3 additions & 1 deletion examples/advanced/xgboost/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -139,7 +139,9 @@ By default, CPU based training is used.
If the CUDA is installed on the site, tree construction and prediction can be
accelerated using GPUs.

To enable GPU accelerated training, in `config_fed_client.json` set `"use_gpus": true` and `"tree_method": "hist"`. Then, in `FedXGBHistogramExecutor` we use the `device` parameter to map each rank to a GPU device ordinal in `xgb_params`. For a single GPU, assuming it has enough memory, we can map each rank to the same device with `params["device"] = f"cuda:0"`.
To enable GPU accelerated training, in `config_fed_client.json` set `"use_gpus": true` and `"tree_method": "hist"`.
Then, in `FedXGBHistogramExecutor` we use the `device` parameter to map each rank to a GPU device ordinal in `xgb_params`.
For a single GPU, assuming it has enough memory, we can map each rank to the same device with `params["device"] = f"cuda:0"`.

### Multi GPU support

Expand Down
31 changes: 27 additions & 4 deletions examples/advanced/xgboost/histogram-based/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,15 +11,21 @@ Switch to this directory and install additional requirements (suggest to do this
python3 -m pip install -r requirements.txt
```

### Run centralized experiments
```
bash run_experiment_centralized.sh
```

### Run federated experiments with simulator locally
Next, we will use the NVFlare simulator to run FL training automatically.
```
bash run_experiment_simulator.sh
nvflare simulator jobs/higgs_2_histogram_v2_uniform_split_uniform_lr \
-w /tmp/nvflare/xgboost_v2_workspace -n 2 -t 2
```

### Run centralized experiments
Model accuracy can be visualized in tensorboard:
```
bash run_experiment_centralized.sh
tensorboard --logdir /tmp/nvflare/xgboost_v2_workspace/simulate_job/tb_events
```

### Run federated experiments in real world
Expand Down Expand Up @@ -51,4 +57,21 @@ The custom executor can inherit the base class `FedXGBHistogramExecutor` and
overwrite the `xgb_train()` method.

To use other dataset, can inherit the base class `XGBDataLoader` and
implement that `load_data()` method.
implement the `load_data()` method.

## Loose integration

We can use the NVFlare controller/executor just to launch the external xgboost
federated server and client.

### Run federated experiments with simulator locally
Next, we will use the NVFlare simulator to run FL training automatically.
```
nvflare simulator jobs/higgs_2_histogram_uniform_split_uniform_lr \
-w /tmp/nvflare/xgboost_workspace -n 2 -t 2
```

Model accuracy can be visualized in tensorboard:
```
tensorboard --logdir /tmp/nvflare/xgboost_workspace/simulate_job/tb_events
```
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
"data_loader_id": "dataloader",
"num_rounds": "{num_rounds}",
"early_stopping_rounds": 2,
"metrics_writer_id": "metrics_writer",
"xgb_params": {
"max_depth": 8,
"eta": 0.1,
Expand All @@ -34,6 +35,16 @@
"args": {
"data_split_filename": "data_split.json"
}
},
{
"id": "metrics_writer",
"path": "nvflare.app_opt.tracking.tb.tb_writer.TBWriter",
"args": {"event_type": "analytix_log_stats"}
},
{
"id": "event_to_fed",
"name": "ConvertToFedEvent",
"args": {"events_to_convert": ["analytix_log_stats"], "fed_event_prefix": "fed."}
}
]
}
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,15 @@
"format_version": 2,
"task_data_filters": [],
"task_result_filters": [],
"components": [],
"components": [
{
"id": "tb_receiver",
"path": "nvflare.app_opt.tracking.tb.tb_receiver.TBAnalyticsReceiver",
"args": {
"tb_folder": "tb_events"
}
}
],
"workflows": [
{
"id": "xgb_controller",
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@
"path": "nvflare.app_opt.xgboost.histogram_based_v2.executor.FedXGBHistogramExecutor",
"args": {
"data_loader_id": "dataloader",
"metrics_writer_id": "metrics_writer",
"early_stopping_rounds": 2,
"xgb_params": {
"max_depth": 8,
Expand All @@ -33,6 +34,16 @@
"args": {
"data_split_filename": "data_split.json"
}
},
{
"id": "metrics_writer",
"path": "nvflare.app_opt.tracking.tb.tb_writer.TBWriter",
"args": {"event_type": "analytix_log_stats"}
},
{
"id": "event_to_fed",
"name": "ConvertToFedEvent",
"args": {"events_to_convert": ["analytix_log_stats"], "fed_event_prefix": "fed."}
}
]
}
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,15 @@
"num_rounds": 100,
"task_data_filters": [],
"task_result_filters": [],
"components": [],
"components": [
{
"id": "tb_receiver",
"path": "nvflare.app_opt.tracking.tb.tb_receiver.TBAnalyticsReceiver",
"args": {
"tb_folder": "tb_events"
}
}
],
"workflows": [
{
"id": "xgb_controller",
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -138,7 +138,7 @@
"outputs": [],
"source": [
"%load_ext tensorboard\n",
"%tensorboard --logdir /tmp/nvflare/workspaces/xgboost_workspace_5_histogram_uniform_split_uniform_lr"
"%tensorboard --logdir /tmp/nvflare/workspaces/xgboost_workspace_5_histogram_uniform_split_uniform_lr/simulate_job/tb_events"
]
}
],
Expand Down
1 change: 0 additions & 1 deletion examples/advanced/xgboost/prepare_job_config.sh
Original file line number Diff line number Diff line change
@@ -1,5 +1,4 @@
#!/usr/bin/env bash
# change to "gpu_hist" for gpu training
TREE_METHOD="hist"

prepare_job_config() {
Expand Down
2 changes: 1 addition & 1 deletion examples/advanced/xgboost/tree-based/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ In addition to basic uniform shrinkage setting where all clients have the same l

## Run automated experiments
Please make sure to finish the [preparation steps](../README.md) before running the following steps.
To run all of the experiments in this example with NVFlare, follow the steps below. To try out a single experiment, follow this [notebook](./xgboost_tree_higgs.ipynb).
To run all experiments in this example with NVFlare, follow the steps below. To try out a single experiment, follow this [notebook](./xgboost_tree_higgs.ipynb).

### Environment Preparation

Expand Down
2 changes: 1 addition & 1 deletion examples/advanced/xgboost/utils/prepare_job_config.py
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@ def job_config_args_parser():
parser.add_argument("--lr_mode", type=str, default="uniform", help="Whether to use uniform or scaled shrinkage")
parser.add_argument("--nthread", type=int, default=16, help="nthread for xgboost")
parser.add_argument(
"--tree_method", type=str, default="hist", help="tree_method for xgboost - use hist or gpu_hist for best perf"
"--tree_method", type=str, default="hist", help="tree_method for xgboost - use hist for best perf"
)
return parser

Expand Down
19 changes: 18 additions & 1 deletion job_templates/vertical_xgb/config_fed_client.conf
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ executors = [
executor {
# Federated XGBoost Executor for histogram-base collaboration
id = "xgb_hist_executor"
name = "FedXGBHistogramExecutor"
path = "nvflare.app_opt.xgboost.histogram_based.executor.FedXGBHistogramExecutor"
args {
num_rounds = 100
early_stopping_rounds = 2
Expand All @@ -23,6 +23,8 @@ executors = [
data_loader_id = "dataloader"
# whether to enable GPU training
use_gpus = false
metrics_writer_id = "metrics_writer"
model_file_name = "test.model.json"
}
}
}
Expand All @@ -47,4 +49,19 @@ components = [
train_proportion = 0.8
}
}
{
id = "metrics_writer"
path = "nvflare.app_opt.tracking.tb.tb_writer.TBWriter"
args {
event_type = "analytix_log_stats"
}
}
{
id = "event_to_fed"
name = "ConvertToFedEvent"
args {
events_to_convert = ["analytix_log_stats"]
fed_event_prefix = "fed."
}
}
]
13 changes: 9 additions & 4 deletions job_templates/vertical_xgb/config_fed_server.conf
Original file line number Diff line number Diff line change
@@ -1,7 +1,4 @@
format_version = 2
server {
heart_beat_timeout = 600
}
task_data_filters = []
task_result_filters = []
workflows = [
Expand All @@ -13,4 +10,12 @@ workflows = [
}
}
]
components = []
components = [
{
id = "tb_receiver"
path = "nvflare.app_opt.tracking.tb.tb_receiver.TBAnalyticsReceiver"
args {
tb_folder = tb_events
}
}
]
19 changes: 19 additions & 0 deletions nvflare/app_common/tracking/log_writer.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,9 @@
# limitations under the License.

from abc import ABC, abstractmethod
from typing import Optional

from nvflare.apis.analytix import AnalyticsDataType
from nvflare.apis.event_type import EventType
from nvflare.apis.fl_component import FLComponent
from nvflare.apis.fl_context import FLContext
Expand Down Expand Up @@ -41,6 +43,23 @@ def handle_event(self, event_type: str, fl_ctx: FLContext):
self.sender = AnalyticsSender(self.event_type, self.get_writer_name())
self.sender.engine = engine

def write(self, tag: str, value, data_type: AnalyticsDataType, global_step: Optional[int] = None, **kwargs):
"""Writes a record.
Args:
tag (str): Tag name
value: Value to send
data_type (AnalyticsDataType): Data type of the value being sent
global_step (optional, int): Global step value.
Raises:
TypeError: global_step must be an int
"""
self.sender.add(tag=tag, value=value, data_type=data_type, global_step=global_step, **kwargs)

@abstractmethod
def get_writer_name(self) -> LogWriterName:
pass

def get_default_metric_data_type(self) -> AnalyticsDataType:
return AnalyticsDataType.METRICS
Loading

0 comments on commit c2c3548

Please sign in to comment.