-
Notifications
You must be signed in to change notification settings - Fork 0
/
advanced.qmd
220 lines (193 loc) · 12.1 KB
/
advanced.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
---
title: "Advanced homology modeling"
date: "August 21, 2023"
date-modified: "`r Sys.Date()`"
format:
html:
page-layout: full
toc: true
toc-location: right
toc-depth: 3
number-sections: true
number-depth: 1
link-external-icon: true
link-external-newwindow: true
bibliography: references.bib
editor:
markdown:
wrap: 80
---
```{r echo=FALSE, output=FALSE}
library(webexercises)
```
# From homology modeling to threading
Although we do not intend to describe in detail the evolution of modeling
methods, I briefly outline below the origin and transformation of advanced
protocols that outperform the classical single-template homology modeling during
the last three decades. This step-wise evolution of modeling methods is the
origin of the revolution of Alphafold and related protocols, which we will
discuss in the next [section](ai.html).
## Threading or Fold-recognition methods {#sec-threading}
As mentioned earlier, the introduction of HMM-based profiles during the first
decade of this century led to a great improvement in template detection and
protein modeling in the twilight zone, i.e., proteins with only distant homologs
(\<25-30% identity) in databases. In order to exploit the power of HMM searches,
those methods naturally evolved into iterative [threading]{.underline} methods,
based on multitemplate model construction, implemented in
[I-TASSER](https://zhanggroup.org/I-TASSER/) [@roy2010],
[Phyre2](http://www.sbg.bio.ic.ac.uk/phyre2/) [@kelley2015], and
[RosettaCM](https://new.rosettacommons.org/demos/latest/tutorials/rosetta_cm/rosetta_cm_tutorial)
[@song2013], among others. These methods are usually referred to as
**Threading** or **Fold-recognition** methods. Note that the classification of
modeling methods is often blurry. The current version of SwissModel and the use
of HHPred+Modeller already rely on HMM profiles for template identification and
alignment; being thus strictly also fold-recognition methods.
Both terms can be often used interchangeably, although some authors see
**Fold-Recognition** as any technique that uses structural information in
addition to sequence information to identify remote homologies, while
**Threading** would refer to a more complex process of modeling including remote
homologies and also the modeling of pairwise amino acid interactions in the
structure. Therefore, HHPRED is a fold-recognition method and its use along with
Modeller, could be indeed considered threading.
![The idea behind fold-recognition is that instead of comparing sequences, we
intend to compare structures. In the Frozen approximation (left), one residue is
aligned with the template structure and then we evaluate the probability of the
nearby residues in the query sequence to be in the same position than the
equivalent in the template. On the other hand, Defrost methods use profiles to
generate improved alignments that allow better starting points to the energy
calculations during the iterative modeling steps. From
@kelley2009a.](pics/fr.png){#fig-frost .figure}
The Iterative Threading ASSembly Refinement
([I-TASSER](http://zhanglab.ccmb.med.umich.edu/I-TASSER)) from [Yang Zhang
lab](https://zhanggroup.org/) is one of the most widely used threading methods
and servers. This method was was ranked as the No 1 server for protein structure
prediction in the community-wide [CASP7](https://zhanggroup.org/casp7/21.html),
[CASP8](https://predictioncenter.org/casp8/groups_analysis.cgi?target_type=0&gr_type=server),
[CASP9](https://predictioncenter.org/casp9/CD/data/html/groups.2.html),
[CASP10](http://predictioncenter.org/casp10/groups_analysis.cgi?type=server),
[CASP11](http://www.predictioncenter.org/casp11/zscores_final.cgi?gr_type=server_only),
[CASP12](http://www.predictioncenter.org/casp12/zscores_final.cgi?&gr_type=server_only),
[CASP13](http://www.predictioncenter.org/casp13/zscores_final.cgi?model_type=best&gr_type=server_only),
and
[CASP14](http://www.predictioncenter.org/casp14/zscores_final.cgi?gr_type=server_only)
experiments. I-TASSER first generates three-dimensional (3D) atomic models from
multiple threading alignments and iterative structural assembly simulations that
are iteratively selected and improved. The quality of the template alignments
(and therefore the difficulty of modeling the targets) is judged based on the
statistical significance of the best threading alignment, i.e., the *Z*-score,
which is defined as the energy score in standard deviation units relative to the
statistical mean of all alignments.
![Flowchart of I-TASSER protein structure modeling. From
@rigden2017a.](images/paste-29011FF5.png){#fig-tasser .figure}
First, I-TASSER uses Psi-BLAST against curated databases to select sequence
homologs and generate a sequence profile. That profile is used to predict the
secondary structure and generate multiple fragmented models using several
programs. The top template hits from each threading program are then selected
for the following steps. In the second stage, continuous fragments in threading
alignments are excised from the template structures and are used to assemble
structural conformations of the sections that aligned well, with the unaligned
regions (mainly loops/tails) built by *ab initio* modeling. The fragment
assembly is performed using a modified replica-exchange Monte Carlo random
simulation technique, which implements several replica simulations in parallel
using different conditions that are periodically exchanged. Those simulations
consider multiple parameters, including model statistics (stereochemical
outliers, H-bond, hydrophobicity...), spatial restraints and amino acid pairwise
contact predictions ([see below](#maps)). In each step, output models are
clustered to select the representative ones for the next stage. A final
refinement step includes rotamers modeling and filtering out steric clashes.
One interesting thing about I-TASSER is that it is integrated within a server
with many other applications, including some of the tools that I-TASSER uses and
other advanced methods based on I-TASSER, like I-TASSER-MTD for large,
multidomain proteins or C-I-TASSER that implements a deep learning step, similar
to Alphafold2 (see [next section](ai.html)).
![RosettaCM Protocol. (A) Flowchart of the RosettaCM protocol. (B--D) RosettaCM
conformational sampling. From
@song2013.](pics/rosettaCM.jpg "RosettaCM Protocol"){#fig-rosettacm .figure
width="664"}
**RosettaCM** is an advanced homology modeling or threading algorithm by the
[Baker lab](https://www.bakerlab.org/), implemented in
[Rosetta](https://www.rosettacommons.org/software) software and the
[Robetta](https://robetta.bakerlab.org/) webserver. RossetaCM provides accurate
models by breaking up the sequence into fragments that are aligned to a set of
selected templates, generating accurate models by a threading processes that
uses different fragments from each of the templates. Additionally it uses minor
*ab initio* folding to fill the residues that could not be assigned during the
threading. Then, the model is closed by iterative optimization steps that
include Monte Carlo sampling. Finally, an *all-atom* refinement towards a
minimum of free energy [@song2013].
::: callout-important
### Puzzling nomenclature: comparative, homology or *ab initio* modeling?
*De novo* or *ab initio* modeling used to mean modeling a protein
[without]{.underline} using a template. However, this strict definition is
blurred in the 2000s (decade) by advanced methods that use fragments. Threading
protocols such as *RosettaCM* and *I-Tasser*, among others, use fragments that
may or may not come from homologous protein structures or not. Therefore, they
cannot be classified as *homology modeling,* but they are sometimes referred to
as *comparative* or hybrid methods.
:::
# From contact maps to pairwise high-res feature maps {#maps}
A **protein contact map** represents the contacts between all possible amino
acid residue pairs of a given protein sequence in the three-dimensional
structure. It is represented as a binary (0,1) matrix with n rows and columns,
being n the number of residues of the sequence. Thus, for two residues, *i* and
*j*, the *ij* element of the matrix is 1 if the two residues are in contact in
the structure. In practical terms, the contact is defined as when the residues
are closer than a specified threshold. In the examples on [Figure
8](#fig-contact), the threshold is 9 Å. Note that the pattern in the map clearly
show differences between motifs and somehow reflex secondary structure
stretches.
![Contact-based map of representative proteins. The map represents a matrix of
amino acid positions in the protein sequences (on both, the X and Y axis); with
contacts indicated as blue dots. When a number of consecutive residues in the
sequence interact the dots form diagonal stretches. Maps obtained at
<http://cmweb.enzim.hu/>](pics/contact.png "Contact-based map"){#fig-contact
.figure}
An accurate information of protein's residue--residue contacts is sufficient to
elucidate the fold of a protein [@olmea1997]; however implementation of these
maps in protein modeling is not trivial, as predicting that map is not always
easy. The introduction of **direct-coupling analysis (DCA)**, i.e., extract the
residue coevolution from MSAs ([Figure 9](#fig-coevol)) improved contact maps
and allowed their implementation for protein folding in several methods, like
PSICOV [@jones2012] or Gremlin [@kamisetty2013], among others. However, it
should be noted for proteins without many sequence homologs, the predicted
contacts were of low quality and insufficient for accurate contact-assisted
protein modeling.
![Schematic of how co-evolution methods extract information about protein
structure from a multiple sequence alignment (MSA). From
@bittrich2019](pics/contact_evol.png){#fig-coevol .figure}
### Implementation of several layers of information processed by neural network and deep learning methods {#sec-NN .section}
Deep learning is a sub-field of machine learning which is based on artificial
neural networks (NN). Neural networks were introduced actually in the late 40's
and 50's, but they reappeared in the 2000's thanks to the increase in
computational capacities and, more recently, the use of
[GPUs](https://en.wikipedia.org/wiki/Graphics_processing_unit). Briefly, a NN
uses multiple interconnected layers to transform multiple inputs (MSAs,
high-resolution contact-based maps...) into compound features that can be used
to predict a complex output, like a 3D protein structure. As their name
indicates, NNs attempt to simulate the behavior of the human brain that
processes large amounts of data and can be trained to "learn" from that data.
Deep learning is based on the use of multiple layer-NN to optimize and refine
for accuracy.
![Illustration of column pair and precision sub-matrix grouping for advanced
prediction of contact maps. In the example, Columns 5 and 14 in the first family
are aligned to columns 5 and 11 in the second family, respectively, so column
pair (5,14) in the first family and the pair (5,11) in the second family are
assigned to the same group. Accordingly, the two precision sub-matrices will be
assigned to the same group. From @ma2015.](pics/contact2.png){#fig-families
.figure}
The next level of complexity in contact maps is their application to distantly
related proteins by comparing sets of DCA from different protein families, a
method sometimes referred as joint evolutionary coupling analysis ([Figure
10](#fig-families)). This kind of analysis entails processing a huge amount of
information, which increases the computational resources requirements. Thus, the
use of trained neural networks and state-of-the-art deep-learning methods
boosted the capacities of protein modeling.
In this context, the introduction of supervised machine learning methods that
predict contacts, outperforms DCA methods by the use of multilayer neural
networks [@jones2015; @ma2015; @wang2017a; @yang2020]. These methods implemented
the use of the so-called high **resolution contact maps** ([Figure
11](#fig-highres)), which contains enriched information with not only contacts
but also distances, and angles, represented in a heatmap-like probability scale.
![Example of high-resolution contact maps of 6MSP. From
@yang2020](pics/high_res_maps.png "High-resolution contact maps"){#fig-highres
.figure}