A supervised learning framework to predict Biosynthetic Gene Clusters (BGCs) in fungi based on a combination of feature types (k-mers, Pfam protein domains, and GO terms).
Make a copy of /src/config.init.DEFAULT, and rename it to /src/config.init. Update the [default] home to the current project root path.
At the [prediction] section in the config.init file, specify the minimum parameters accordingly:
- the
task:train,validation, ortest - indicate the corpus location in
source.path - (if using sequences) indicate the
source.type:nucleotideoraminoacid - specify the positive instances % in
pos.perc - indicate the
feat.typeaskmers,domainsorgo(if combining multiple features, separate them with a-, as ingo-kmers-domains) - set the minimum occurrences to consider a feature in
feat.minOcc - set the k-mer length in
feat.size - select a
classifier:logit,mlp,linearsvc,nusvc,svc,randomforest
To run the classification task from the project virtualenv simply:
(.env) user@foo:~fungalbgcs/src$ python -m pipeprediction.MLThe train task will generate a /metrics folder, with:
- the (re-load-able) model file
(classifier)_(featuretype).model.pkl - a list of features file
(featuretype).feat
The validation task will also generate in the /metrics folder:
- a performance file
(classifier)_(featuretype).validwith P, R, F-m and a confusion matrix - a list of {valid_instance_IDs, predicted label} file
(classifier)_(featuretype).IDs.valid
The test task requires either train or validation to have been performed, since it will read from the model *.model.pkl and feature *.feat files. It generates in the /metrics folder:
- a performance file
(classifier)_(featuretype).testwith P, R, F-m and a confusion matrix - a list of {test_instance_IDs, predicted label} file
(classifier)_(featuretype)_(testfolder).IDs.test, used as input for evaluation against gold clusters
Datasets: Openly available fungal BGC datasets to train and validate models (details here).
External software: To set up Pfam for protein domain annotation locally, please refer to the steps on /extSoftware/.