This repository contains the source code of our research article
Privacy Preserving Federated Unsupervised Domain Adaptation with Application to Age Prediction from DNA Methylation Data
Please refer to FREDA-CV for a simplified, general-purpose implementation that replaces domain similarity with cross-validation.
- Python 3.8.18
To install the required Python libraries, run:
pip install -r requirements.txtThe following arguments can be configured when running the main.py script:
| Argument | Description | Default Value |
|---|---|---|
--setup |
Number of source clients to simulate. | 2 |
--dist |
Distribution identifier for the experiment. | 0 |
--use_precomputed_confs |
Whether to use precomputed confidence scores. | True |
--use_precomputed_lambdas |
Whether to use precomputed optimal lambdas. | True |
--lambda_path |
Path to a text file containing lambda values. If not provided, default values are used. | None |
--home_path |
Root directory for the project. Can be set to any desired path. | Current directory |
--alpha |
Weighting factor for the loss function. | 0.8 |
--epochs |
Number of local training epochs. | 20 |
--global_iterations |
Number of global iterations. | 100 |
--lr_init |
Initial learning rate. | 0.0001 |
--lr_final |
Final learning rate. | 0.00001 |
--k_value |
Exponent of the weight function for transforming confidences into weights. | 3 |
Here’s an example of how to run the experiment with sample arguments:
python main.py --setup 2 --dist 0 --use_precomputed_confs False --use_precomputed_lambdas False --lambda_path ./lambdas.txt --home_path ./FREDA/ --alpha 0.8 --epochs 20 --global_iterations 100 --lr_init 0.0001 --lr_final 0.00001 --k_value 3We utilized DNA methylation data and donor age information from two main sources:
-
The Cancer Genome Atlas (TCGA)
- Reference:
Weinstein, J.N., Collisson, E.A., Mills, G.B., Shaw, K.R., Ozenberger, B.A., Ellrott, K., Shmulevich, I., Sander, C., Stuart, J.M.: The cancer genome atlas pan-cancer analysis project. Nature Genetics, 45(10), 1113–1120 ( 2013).
DOI: 10.1038/ng.2764
- Reference:
-
The Gene Expression Omnibus (GEO)
- Reference:
Edgar, R., Domrachev, M., Lash, A.E.: Gene expression omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Research, 30(1), 207–210 (2002).
DOI: 10.1093/nar/30.1.207
- Reference:
Alternatively, tissue data used in the experiments can be access from the following repository (https://github.com/greenelab/wenda_gpu_paper.git) under data/handl. In order to run the experiments, users need to download the source and target data and their labels, as well as the phenotypes for both the source and target domain and place all the files inside the empty dna_data directory. Partitioned datasets for the experiments can then be generated by running the prep_source_data.py and prep_target_data.py.
To be able to train the final adaptive models, users also need to place the translated tissue similarities inside a folder named tissueSimilarityFromNaturePaper/ in the working directory in csv format.
The tissue similarities we used is from the data translated from the following paper:
- Reference:
Aguet, F. et al. Genetic effects on gene expression across human tissues. NATURE 550, 204–213 (2017).
DOI: 10.1038/nature24277