Questions about the data used for ML models training #5

robertoimuno · 2024-02-23T12:12:31Z

Hi,

I hope this message finds you well. My name is Roberto, and I've been reading your paper recently, a very interesting study.

I have a few questions about the utilization of NGS files containing antibodies categorized as binders and non-binders. In the 'scripts/main.py' file of the repository, there are the following lines of code:

# Load non-binding sequences
ab_neg_files = [
    'mHER_H3_1_Ab.txt', 'mHER_H3_1_AgN.txt',
    'mHER_H3_2_Ab.txt', 'mHER_H3_2_AgN.txt',
    'mHER_H3_3_Ab.txt', 'mHER_H3_3_AgN.txt'
]
mHER_H3_AgNeg = load_input_data(ab_neg_files, Ag_class=0)
# Load binding sequences
ab_pos_files = [
    'mHER_H3_1_2Ag647.txt', 'mHER_H3_1_2Ag488.txt',
    'mHER_H3_2_2Ag647.txt', 'mHER_H3_2_2Ag488.txt',
    'mHER_H3_3_2Ag647.txt', 'mHER_H3_3_2Ag488.txt'
]
mHER_H3_AgPos = load_input_data(ab_pos_files, Ag_class=1)

If I understand correctly, these lines of code read NGS-processed files for binders and non-binders from three distinct DMS enrichment rounds, merging them to create two files: 'data/mHER_H3_AgNeg.csv' and 'data/mHER_H3_AgPos.csv'. Subsequently, in a later step, these files are concatenated, and the class ratio is adjusted. Notably, a subset of non-binders is removed during this adjustment, only to be reintroduced incrementally into the training set later.

The incremental imbalance of the training dataset is only done when benchmarking the different ML algorithms, in a later step when training only the CNN the training set is not incremented.

Upon looking into the incremented datasets employed for model training, I observed that, except for the training set with a binders-to-non-binders ratio of 0.5, all other sets contain some instances where the same CDR3H sequence appears with different labels. I guess this happens because an antibody can be a binder for instance in the results coming from the first DMS enrichment round but is a non-binder in the last.

This is an image showing an example of two sequences in which this happens:

- Is this observation correct? Does this labeling duality serve a specific purpose in the training process? If so, could you shed light on the benefits of adopting this approach when benchmarking the ML algorithms?

- Could you share your perspective on the advantages of incorporating data from different rounds of DMS enrichment in the training set?

Thank you in advance for your time and insights.
Kind regards,
Roberto

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions about the data used for ML models training #5

Questions about the data used for ML models training #5

robertoimuno commented Feb 23, 2024

Questions about the data used for ML models training #5

Questions about the data used for ML models training #5

Comments

robertoimuno commented Feb 23, 2024