Regina Ye
Instructions for scripts.
This project is an implementation of performer neural networks to analyze Gene-Environment (GxE) interactions and their impact on phenotypic predictions. It's based on the code developed by Måløy et al., available at https://github.com/haakom/pay-attention-to-genomic-selection.
- 10-fold Cross-Validation
- Hyperparameter optimization with Optuna
- Support for SLURM-based cluster optimization
- Customizable settings for dataset handling and model training
To install the necessary components for this project, ensure that you have Anaconda installed. To clone the repository and set up the environment:
-
Clone the repository: git clone https://github.com/harpak-lab/g-x-e-model-selection.git
-
Navigate to the source directory: cd g-x-e-model-selection/src
-
Create the Anaconda environment from the
environment.yml
file: conda env create -f environment.yml
After creating the environment, activate it using conda activate GxE_predict_phenotypes
You will also need to manually install (cat-bgen), (plink), and (plink2).
Phenotype files are obtained from the UK Biobank and renamed pheno_(phenotype code).txt.
Imputed data for the UK Biobank was downloaded using the (ukbgene utility) in the Oxford bgen format.
Context files are obtained from the UK Biobank and renamed wc.covariates.txt
Before running the main model, it's essential to preprocess your data using the provided script.
-
Locate the Preprocessing Script and Configuration File: Find
preprocess_data.sh
andconfig.sh
in the prep_data directory. -
Provide File Paths in the
config.sh
File: Provide the paths to your data files in the config.sh file. Ensure that your .bgen files have the naming scheme ukb_imp_chr${CHR}_v3.bgen and the .sample files have the naming scheme ukb61666_imp_chr${CHR}_v3_s487280.sample. -
Run the Preprocessing Script:
Execute the preprocessing script:
./preprocess_data.py
This script may require several hours to finish executing.
To run the project:
python main.py
--hyper_search
: Set to 1 for hyperparameter search.--total_num_trials
: Number of trials for the study.--database_name
: Path to the hyperparameter search database.--datasets_dir
: Directory where datasets are stored.--dataset
: Name of the dataset to use.--inverse_transform
: Path to the inverse transform of the target data.--standardize_genome
: Set to 1 to standardize the genome data.--batch_size
: Batch size for training.--num_epochs
: Number of epochs for training.- More arguments can be configured in the script.
For example, in the src folder, you can run:
python main.py \
--total_num_trials 3 \
--database_name ../hyperparameter_search_database.sqlite \
--datasets_dir ../dataset/ \
--dataset genome_HDL_phenotype_context_data.lmdb \
--inverse_transform HDL_phenotype_scaler.joblib \
--num_hparams_explor 10 \
--batch_size 50 \
--gene_length 3000 \
--gene_size 4 \
--num_nodes 3 \
--num_epochs 10 \
--gpus 0 \
This project is based on the work of ([Måløy et al.] (https://github.com/haakom/pay-attention-to-genomic-selection)) Please refer to their repository for the original implementation. Parts of the project use code written by ([Phil Wang] (https://github.com/lucidrains/nystrom-attention)). Parts of this README and the QC.sh file are taken from ([Zhu et al.'s] (https://github.com/harpak-lab/amplification_gxsex) )work.