Questions about active learning with relative label problem #13
-
Dear Dr. Walters and community, I appreciate the detailed tutorial on developing an active learning classifier. However, I have some concerns regarding the feasibility of using a simple oracle, such as computational methods like docking or binding free energy (BFE) calculations, in prospective drug discovery campaigns where experimental validation isn’t pre-conducted. Challenge of Labeling Virtual Actives One issue arises when, in the first iteration, 100K compounds are docked and 1K are labeled as virtual actives. In subsequent iterations, we predict the labels of the remaining compounds and likely select the next 100K compounds with the highest probability of being active for further docking. This process can lead to a situation where the initial 1K compounds may change their labels from active (1) to inactive (0), introducing bias and potentially compromising the model’s integrity. Another challenge occurs with uncertainty sampling. In this method, the least confident compounds are included in the next iteration, expanding the training set from 100K to 200K compounds. This change impacts the initial percentile calculation, possibly leading to the reclassification of compounds from inactive to active, which can disrupt the model and its predictions. Given these challenges, I am seeking insights or suggestions on how to manage the dynamic changes in compound labels across iterations to maintain model fairness and accuracy. Specifically: How can we handle dynamic label changes to ensure the model remains unbiased and accurate? Best regards, |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
Hi Phong, Thanks for your detailed question. For virtual screening, I always use a regression model, so I don't have to worry about the threshold shifts you describe. However, even with a classification model, I don't think the issue you mention will be a factor. Relabeling specific instances shouldn't be a problem if a model is retrained at each active learning iteration. Best, Pat |
Beta Was this translation helpful? Give feedback.
Hi Phong,
Thanks for your detailed question. For virtual screening, I always use a regression model, so I don't have to worry about the threshold shifts you describe. However, even with a classification model, I don't think the issue you mention will be a factor. Relabeling specific instances shouldn't be a problem if a model is retrained at each active learning iteration.
Best,
Pat