readme

New analysis pipeline for high-throughput domain-peptide affinity experiments improves SH2 interaction data
Tom Ronan, Roman Garnett, and Kristen Naegle


Note on dependencies: Files and data are located in /externaldata /ms_func_lib and /origdata subdirectories.

Analysis for the manuscript was performed as follows:

Running:

    python 01_load_and_convert_raw_data_from_expt_format.py

reads the raw data from the original plate format as supplied by the original authors, and produces a flat-table
version of the original raw data with minimal changes made (see comments in code). The file produced is:

    01_rawdata_loaded_from_expt_format.csv

Researchers wanting to do their own independent analysis of the raw measurements can start with this, more conveniently
formatted version of the raw data.

Next, the raw measurements are grouped, and curve fitting and model selection are performed as described in the
manuscript. To perform that step, run (run time is approximately 21 minutes or more):

    python 02_Kd_curve_fitting_and_model_selection.py

which uses the '01_rawdata_loaded_from_expt_format.csv' file as input. The output of curve fitting and model selection
is found here:

    02_fitted_Kd_and_model_selection_data.csv

Finally, the steps of protein functionality analysis and turning multiple replicates into a single reported affinity is
performed in the following code:

    python 03_replicate_and_functionality_analysis.py

which uses the '02_fitted_Kd_and_model_selection_data.csv' file as input. The output of this code is found in two files:

    03a_fitted_Kd_and_model_selection_data_with_functionality_evaluated.csv
    03b_grouped_data_including_published.csv

The first file, '03a_fitted_Kd_and_model_selection_data_with_functionality_evaluated.csv', shows the results at
the replicate level after protein functionality has been evaluated. The second file,
'03b_grouped_data_including_published.csv', contains the final results at the domain-peptide level.


-------------------------

The file '01_rawdata_loaded_from_expt_format.csv', is a conversion of the plate-style raw data format from the original raw data to a flat-table style format, with no additional analysis. Each row in the data represents a single well on the original plate. The data file contains the following columns:

	filename: The filename from the original publication raw data for this row 
	plate_number: The 384-well plate identification number from the original publication raw data
	peptide_number: the id number of the peptide from the original publication master peptide list (found at
		'/origdata/Jones FP master peptide list edited.csv')
	peptide_description: A text description of the peptide from the original publication raw data.
	domain_name: the name of the protein domain tested against the peptide on the plate
	domain_pos: the position on the plate of the domain (relevant in the cases where the same domain was tested more than
		once on the plate)
	domain_conc: the concentration of the domain in the specific well
	FP: the FP value
	bg: a background value common to the entire plate

	Note: There are 12 concentration measurements per domain peptide pair.
	In order to identify the unique domain-peptide pair tested, the filename,
	the plate_number, and the peptide number, plus the the domain and the domain
	position are needed to provide a unique identifier for an experimental interaction.

The file '02_fitted_Kd_and_model_selection_data.csv', contains the 12 measurements for each measurement compressed into one row. Each row then reprsents one full affinity measurement, with the 12 concentrations and 12 FP values. In addition, each row contains results from all models fitted, plus the model selected. The data file contains the following columns:

	domain: protein domain
	gene_name: peptide gene source
	pY_pos: phosphotyrosine position in the original source of the peptide (which, when combined with the gene name, results
		in the common phosphopeptide descriptor, eg. EFGR pY 1172)
	peptide_number: the id number of the peptide from the original publication master peptide list (found at
		'/origdata/Jones FP master peptide list edited.csv')
	peptide_description: A text description of the peptide from the original publication raw data.
	plate_number: The 384-well plate identification number from the original publication raw data
	filename: The filename from the original publication raw data for this row 
	domain_pos: the position on the plate of the domain (relevant in the cases where the same domain was tested more than
		once on the plate)
	bg: a background value common to the entire plate
	pepConcTxt: a text field concatenating the peptide concentration list
	FluorTxt: a text field concatenating the FP value list for each of the concentrations
	PeptideConc: a non-text field concatenating the peptide concentration list (for use in the analysis script)
	Fluorescence: a non-text field concatenating the FP value list for each of the concentrations (for use in the analysis script)
	linear_slope: the fitted slope of the linear model fit
	linear_offset: the fitted offset (in FP units) of the linear model fit
	linear_rsq: the r-squared calculation for the linear fit
	linear_AICc: the AICc score for the linear fit
	linear_resVar: variance of the residuals for the linear fit
	linear_success: a boolean indicator of linear fit success
	fo_Fmax: the fitted Fmax (saturation) for the first-order model fit
	fo_Kd: the fitted affinity (Kd, in uM) of the first-order model fit
	fo_offset: the fitted offset (in FP units) of the first-order model fit
	fo_rsq: the r-squared calculation for the first-order fit
	fo_AICc: the AICc score for the first-order fit
	fo_resVar: variance of the residuals for the  the first-order fit
	fo_success: a boolean indicator of first-order fit success
	peptide_str: a text string representing the peptide sequence
	pep_seq_cleaned: an alternate version of the text string representing the peptide sequence, without special characters
	pep_seq_10mer_aligned: an aligned (to the pY) 10-mer representation of the peptide string for easy peptide comparison
	domain_seq_align_gapless: the SH2 protein domain sequence from an internally generated alignment (gaps removed)
	simJones_Kd: internal use fields for comparison to the original publication
	simJones_Fmax: internal use fields for comparison to the original publication
	simJones_offset: internal use fields for comparison to the original publication
	simJones_rsq: internal use fields for comparison to the original publication
	simJones_binding_call: internal use fields for comparison to the original publication
	fit_method: the selected model fit (linear or first-order) based on the lowest AICc score
	fit_slope: the fitted slope of the selected fit (if available)
	fit_Kd: the fitted affinity (Kd) of the selected fit (if available)
	fit_Fmax: the fitted Fmax (saturation) for the selected fit (if available)
	fit_offset: the fitted offset (in FP units) of the selected fit
	fit_rsq: the r-squared calculation for the selected fit
	fit_AICc: the AICc score for the selected fit
	fit_resVar: variance of the residuals for the selected fit
	fit_success: a boolean indicator of the selected fit success
	fit_residuals: a text field concatenating the residuals at each point of the selected fit
	fit_modelDNR: the dynamic range of the selected fit (Fmax-offset)
	fit_sumres: the sum of the residuals for the selected fit
	fit_snr: the fit signal to noise ratio calculated by fit_sumres/fit_modelDNR
	fit_dnr: the dynamic range of the FP data (max FP-min FP)
	binding_call: the final categorization of the fit after model selection and additional evaluation


The file '03a_fitted_Kd_and_model_selection_data_with_functionality_evaluated.csv' contains the same data as '02_fitted_Kd_and_model_selection_data.csv', but with the additional fields relating to functional protein identification:

	binder_in_rep_grp: indicates whether there was any binder in the replicate group for a domain-peptide
	protein_non_functional: indicates whether the domain protein tested in this measurement is likely to be non-functional
	binding_call_functionality_evaluated: indicates the revised call for this measurement based on the protein functionality assessment


The file '03b_grouped_data_including_published.csv' contains the final affinity results of our revised analysis after replicate group evaluation, as well as the original published affinity. Fields are as above, plus:


	count_replicates: total count of all replicate measurements for a given domain-peptide pair
	count_bind: count of replicates categorized as binders
	count_nobind: count of replicates categorized as non-binders
	count_agg: count of replicates categorized as aggregators
	count_lowSNR: count of replicates categorized as low-SNR interactions
	count_nonfunctional: count of replicates for which categorization changed to non-functional after protein functionality evaluation
	pub_Kd_mean: the published mean affinity (Kd, in uM)
	pub_num_replicates: the number of replicates indicated in the original publication (represents the number before certain
		filtering processess were performed in the original analysis, use with caution)