-
Notifications
You must be signed in to change notification settings - Fork 0
/
readme
146 lines (111 loc) · 9.04 KB
/
readme
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
New analysis pipeline for high-throughput domain-peptide affinity experiments improves SH2 interaction data
Tom Ronan, Roman Garnett, and Kristen Naegle
Note on dependencies: Files and data are located in /externaldata /ms_func_lib and /origdata subdirectories.
Analysis for the manuscript was performed as follows:
Running:
python 01_load_and_convert_raw_data_from_expt_format.py
reads the raw data from the original plate format as supplied by the original authors, and produces a flat-table
version of the original raw data with minimal changes made (see comments in code). The file produced is:
01_rawdata_loaded_from_expt_format.csv
Researchers wanting to do their own independent analysis of the raw measurements can start with this, more conveniently
formatted version of the raw data.
Next, the raw measurements are grouped, and curve fitting and model selection are performed as described in the
manuscript. To perform that step, run (run time is approximately 21 minutes or more):
python 02_Kd_curve_fitting_and_model_selection.py
which uses the '01_rawdata_loaded_from_expt_format.csv' file as input. The output of curve fitting and model selection
is found here:
02_fitted_Kd_and_model_selection_data.csv
Finally, the steps of protein functionality analysis and turning multiple replicates into a single reported affinity is
performed in the following code:
python 03_replicate_and_functionality_analysis.py
which uses the '02_fitted_Kd_and_model_selection_data.csv' file as input. The output of this code is found in two files:
03a_fitted_Kd_and_model_selection_data_with_functionality_evaluated.csv
03b_grouped_data_including_published.csv
The first file, '03a_fitted_Kd_and_model_selection_data_with_functionality_evaluated.csv', shows the results at
the replicate level after protein functionality has been evaluated. The second file,
'03b_grouped_data_including_published.csv', contains the final results at the domain-peptide level.
-------------------------
The file '01_rawdata_loaded_from_expt_format.csv', is a conversion of the plate-style raw data format from the original raw data to a flat-table style format, with no additional analysis. Each row in the data represents a single well on the original plate. The data file contains the following columns:
filename: The filename from the original publication raw data for this row
plate_number: The 384-well plate identification number from the original publication raw data
peptide_number: the id number of the peptide from the original publication master peptide list (found at
'/origdata/Jones FP master peptide list edited.csv')
peptide_description: A text description of the peptide from the original publication raw data.
domain_name: the name of the protein domain tested against the peptide on the plate
domain_pos: the position on the plate of the domain (relevant in the cases where the same domain was tested more than
once on the plate)
domain_conc: the concentration of the domain in the specific well
FP: the FP value
bg: a background value common to the entire plate
Note: There are 12 concentration measurements per domain peptide pair.
In order to identify the unique domain-peptide pair tested, the filename,
the plate_number, and the peptide number, plus the the domain and the domain
position are needed to provide a unique identifier for an experimental interaction.
The file '02_fitted_Kd_and_model_selection_data.csv', contains the 12 measurements for each measurement compressed into one row. Each row then reprsents one full affinity measurement, with the 12 concentrations and 12 FP values. In addition, each row contains results from all models fitted, plus the model selected. The data file contains the following columns:
domain: protein domain
gene_name: peptide gene source
pY_pos: phosphotyrosine position in the original source of the peptide (which, when combined with the gene name, results
in the common phosphopeptide descriptor, eg. EFGR pY 1172)
peptide_number: the id number of the peptide from the original publication master peptide list (found at
'/origdata/Jones FP master peptide list edited.csv')
peptide_description: A text description of the peptide from the original publication raw data.
plate_number: The 384-well plate identification number from the original publication raw data
filename: The filename from the original publication raw data for this row
domain_pos: the position on the plate of the domain (relevant in the cases where the same domain was tested more than
once on the plate)
bg: a background value common to the entire plate
pepConcTxt: a text field concatenating the peptide concentration list
FluorTxt: a text field concatenating the FP value list for each of the concentrations
PeptideConc: a non-text field concatenating the peptide concentration list (for use in the analysis script)
Fluorescence: a non-text field concatenating the FP value list for each of the concentrations (for use in the analysis script)
linear_slope: the fitted slope of the linear model fit
linear_offset: the fitted offset (in FP units) of the linear model fit
linear_rsq: the r-squared calculation for the linear fit
linear_AICc: the AICc score for the linear fit
linear_resVar: variance of the residuals for the linear fit
linear_success: a boolean indicator of linear fit success
fo_Fmax: the fitted Fmax (saturation) for the first-order model fit
fo_Kd: the fitted affinity (Kd, in uM) of the first-order model fit
fo_offset: the fitted offset (in FP units) of the first-order model fit
fo_rsq: the r-squared calculation for the first-order fit
fo_AICc: the AICc score for the first-order fit
fo_resVar: variance of the residuals for the the first-order fit
fo_success: a boolean indicator of first-order fit success
peptide_str: a text string representing the peptide sequence
pep_seq_cleaned: an alternate version of the text string representing the peptide sequence, without special characters
pep_seq_10mer_aligned: an aligned (to the pY) 10-mer representation of the peptide string for easy peptide comparison
domain_seq_align_gapless: the SH2 protein domain sequence from an internally generated alignment (gaps removed)
simJones_Kd: internal use fields for comparison to the original publication
simJones_Fmax: internal use fields for comparison to the original publication
simJones_offset: internal use fields for comparison to the original publication
simJones_rsq: internal use fields for comparison to the original publication
simJones_binding_call: internal use fields for comparison to the original publication
fit_method: the selected model fit (linear or first-order) based on the lowest AICc score
fit_slope: the fitted slope of the selected fit (if available)
fit_Kd: the fitted affinity (Kd) of the selected fit (if available)
fit_Fmax: the fitted Fmax (saturation) for the selected fit (if available)
fit_offset: the fitted offset (in FP units) of the selected fit
fit_rsq: the r-squared calculation for the selected fit
fit_AICc: the AICc score for the selected fit
fit_resVar: variance of the residuals for the selected fit
fit_success: a boolean indicator of the selected fit success
fit_residuals: a text field concatenating the residuals at each point of the selected fit
fit_modelDNR: the dynamic range of the selected fit (Fmax-offset)
fit_sumres: the sum of the residuals for the selected fit
fit_snr: the fit signal to noise ratio calculated by fit_sumres/fit_modelDNR
fit_dnr: the dynamic range of the FP data (max FP-min FP)
binding_call: the final categorization of the fit after model selection and additional evaluation
The file '03a_fitted_Kd_and_model_selection_data_with_functionality_evaluated.csv' contains the same data as '02_fitted_Kd_and_model_selection_data.csv', but with the additional fields relating to functional protein identification:
binder_in_rep_grp: indicates whether there was any binder in the replicate group for a domain-peptide
protein_non_functional: indicates whether the domain protein tested in this measurement is likely to be non-functional
binding_call_functionality_evaluated: indicates the revised call for this measurement based on the protein functionality assessment
The file '03b_grouped_data_including_published.csv' contains the final affinity results of our revised analysis after replicate group evaluation, as well as the original published affinity. Fields are as above, plus:
count_replicates: total count of all replicate measurements for a given domain-peptide pair
count_bind: count of replicates categorized as binders
count_nobind: count of replicates categorized as non-binders
count_agg: count of replicates categorized as aggregators
count_lowSNR: count of replicates categorized as low-SNR interactions
count_nonfunctional: count of replicates for which categorization changed to non-functional after protein functionality evaluation
pub_Kd_mean: the published mean affinity (Kd, in uM)
pub_num_replicates: the number of replicates indicated in the original publication (represents the number before certain
filtering processess were performed in the original analysis, use with caution)