homology.qmd

---
title: "Homology Modeling"
date: "August 21, 2023"
date-modified: "`r Sys.Date()`"
format: 
  html:
    page-layout: full
toc: true
toc-location: right
toc-depth: 2
number-sections: true
number-depth: 1
link-external-icon: true
link-external-newwindow: true 
bibliography: references.bib
editor: 
  markdown: 
    wrap: 80
---

```{r echo=FALSE, output=FALSE}

library(webexercises)
```

# Modeling protein structures: Mind the gap

The number of protein sequences in databases growth exponentially during the
last decades, particularly after the revolution of high throughput sequencing
methods.

[![Number of entries in
UniProt/TrEMBL.](https://www.ebi.ac.uk/inc/drupal/uniprot/statistics/relstat1.png "Number of entries in UniProt/TrEMBL"){#fig-seqs
.figure}](https://www.ebi.ac.uk/uniprot/TrEMBLstats)

However, experimental determination of 3D protein structures is often difficult,
time-consuming, and subjected to limitations, such as experimental error, data
interpretation and modeling new data on previously released structures. Thus,
despite substantial efforts that started at the beginning of the 21st century to
implement high-throughput structural biology methods [see for instance
@manjasetty2008], the availability of protein structures is more than 1,000
times less than the number of sequences (\>250M sequences for Uniprot and ∽205k
structures in RCSB Protein Databank in September 2023). This difference is
called the ***protein sequence-structure gap*** and it is constantly widening
[@muhammed2019], especially if we consider the sequences from metagenomics
databases not available in Uniprot.

[![Number of macromolecular structures in RCSB PDB database (accessed 19th
September
2023).](pics/RCSB_stats.png "Number of macromolecular structures in RCSB PDB database."){#fig-structures
.figure}](https://www.rcsb.org/stats/growth/growth-released-structures)

Thus, an accurate prediction of the 3D structure of any given protein is needed
to make up for the lack of experimental data.

## Are protein structures predictable at all?

The properties of the amino acids determine the Φ and Ψ angles that eventually
shape the higher structural levels. However, protein folding can be more
complex, as it should be coupled with protein synthesis.

One can imagine that the complexity and diversity of protein structures in
nature can be enormous. Indeed, John Kendrew and his co-workers seemed very
disappointed in the determination of the first three-dimensional globular
structure, the myoglobin in 1958 [@kendrew1958]:

> *Perhaps the most remarkable features of the molecule are its complexity and
> its lack of symmetry. The arrangement seems to be almost totally lacking in
> the kind of regularities which one instinctively anticipates, and it is more
> complicated that has been anticipated by any theory of protein structure.*

Not much later, in 1968, Cyrus Levinthal (1922--1990) published the so-called
*Levinthal's paradox*, stating that proteins fold in nano/milliseconds, but even
for small peptides it will take a huge time to test the astronomical number of
possible conformations. Say a 100 aa small protein; it will have 99 peptidic
bonds and 198 different phi and psi angles. Assuming only 3 alternative
conformations for each bond, it will yield 3^198^ (= 2.95 x 10^95^) possible
conformations. If we design a highly efficient algorithm that tests 1
conformation per nanosecond:

|  2.95 x 10^85^ secs = **9x10^67^ billions years**

Considering that the age of the universe is 13.8 billion years, predicting
protein structures does not seem an easy task.

In this context, a very simple experiment 50 years ago, led some light on the
protein folding mechanism. [Cristian
Anfisen](https://en.wikipedia.org/wiki/Christian_B._Anfinsen) was able to
completely denature (unfold) the Ribonuclease A, by the addition of reducing
agents and urea under heat treatment, and subsequently switch to normal
conditions that allow the protein to re-fold fully functional. This experiment
indicates that the amino acid sequence dictates the final structure.
Notwithstanding some relevant exceptions, this has been largely confirmed.

![The Anfinsen Dogma: Amino acid sequence dictated the final structure. From
@anfinsen1973 .](pics/anfinsen.jpg "Anfinsen"){#fig-anfinsen .figure}

One can imagine that *in vivo* native structures of proteins look-alike the
lowest free energy conformation, i.e., the global energy minimum. That is the
basis of the funnel model of protein folding, which assumes that the number of
possible conformations is reduced when a local energy minimum is achieved,
constituting a path for the folding process.

[![Schematic diagram of a protein folding energy landscape according to the
funnel model. Denatured molecules at the top of the funnel might fold to the
native state by a myriad of different routes, some of which involve transient
intermediates (local energy minima) whereas others involve significant kinetic
traps (misfolded states). From
@radford2000.](pics/funnel_1.jpeg "The funnel of protein folding"){#fig-funnel
.figure}](https://www.sciencedirect.com/science/article/pii/S0968000400017072?casa_token=BR_maX_GNYYAAAAA:IZrQqI9jIbiv-ZeA6aOMHQxsVcp-wgy0XPFO3DRcFiAi7TSrX-3cc7Jb6dhTHdXSQseEhF3l#BIB49)

::: callout-important
### Conclusion

Prediction of protein structures is possible, as protein folding relies only on
the protein sequence, but it will require virtually infinite time and
computational resources ...or highly efficient ones, as we will see.
:::

**Homology modeling** is one of the most convenient tricks to get around this
limitation. Basically, the strategy is to add more layers of additional
information to amino acid properties, namely evolutionary conservation of
sequences and structures.

Very often, you already have some information about your protein before you
create models. For example, if it is an enzyme, you may have discovered the
catalytic residues or the substrate interaction region. It is also advisable to
look in the literature, in particular to see if there is a companion paper to
the corresponding PDB structure(s) that may be available or that you may be able
to find and use as a template(s) for modeling.

# Homology modeling: Conservation is the key

![Protein modeling in a nutshell is about going from the sequence to the
structure. But, how do we do
it?](pics/modeling.png "Protein modeling"){#fig-nutshell .figure}

Traditionally and conceptually, protein structure prediction can be approached
from two different perspectives: *comparative modeling* and *ab initio*
prediction. In the comparative modeling approach, one can predict the 3D
structure of protein sequences only if their homologs are found in the database
of proteins with known structures. Of course, identification of such homologs is
the key here. Until recently, most of the development in protein modeling has
been driven by the development of methods to identify distant sequence
similarities that would reflect similar protein folds. On the other hand, *ab
initio* modeling is only based on the physicochemical properties of the
molecule.

[![Two different approaches for structure
prediction.](pics/predway.jpeg "Two different approaches for structure prediction."){#fig-modelingways
.figure}](http://isw3.naist.jp/IS/Bio-Info-Unit/gogroup/study/study-en.html)

The accuracy of homology modeling is limited by the availability of similar
structures. *Ab initio* predictions are limited by mathematical models and
computational resources and are often useful only for small peptides.

To maintain structure and function, certain amino acids in the protein sequence
are subject to greater selection pressure. They either evolve more slowly than
expected or within certain constraints, such as chemical similarity (i.e.,
conservative substitutions). Therefore, homology modeling approaches assume that
a similar protein sequence involves similar 3D structure and function. Although
the number of published sequences and structures is constantly increasing, the
number of unique folds remained nearly constant since 2008 until very recently.

![Growth of the protein structure database since its inception in 1974. From
[Bonvin lab
site](https://www.bonvinlab.org/education/molmod_online/modelling/).](pics/rcsb-statistics.png "Protein structures and unique folds"){#fig-bovin
.figure}

This means that **the space of protein sequences is much larger than the space
of structures**. This has been exploited by some structural databases such as
Scop2 or CATH, which use hierarchical classifications of structures into very
few categories of distinct structures.

![[CATH](https://www.cathdb.info/) main categories exemplified the limited
structures space, with 86k vs 500k domains in 2006 and 2022, respectively, but
40 and 41 architectures in the same years.](pics/cath.png "CATH"){#fig-cath
.figure width="632"}

In summary, protein structures are more conserved than sequences, which allows
the construction of models by comparing proteins with different sequences.
Biological sequences evolve by mutation and selection, and selection pressure
varies for each residue position in a protein depending on its structural and
functional importance. Sequence alignments attempt to tell us the evolutionary
story of proteins.

# Homology modeling in four steps

![Workflow of template-based protein modeling. From Expasy [Protein Structure,
Comparative Protein Modelling, and Visualisation online
course](https://swissmodel.expasy.org/static/course/files/PartII_homology_modelling.pdf).](pics/homology_simple_workflow.png "Homology Modeling workflow"){#fig-modelingsteps
.figure width="397"}

A basic workflow of homology modeling requires three steps: (1) identification
of the most appropriate template, (2) alignment of the query sequence and
template, and (3) construction of the model. These steps can be approached with
different methodological alternatives and may yield different results that need
to be evaluated (step 4) to find the best solution for each step. For a more
detailed overview of homology modeling, I recommend reading @haddad2020.

In this course, we will focus on end-to-end modeling using
[SWISS-MODEL](https://swissmodel.expasy.org/) [@waterhouse2018], a fully
automated modeling server that allows you to construct models without having to
have a strong bioinformatics background or programming skills. Early versions of
SWISS-MODEL only allowed modeling of sequences with homologs in databases, but
as discussed below, implementation of advances in template recognition and model
building has increased modeling capability and accuracy, especially in the last
15 years.

SWISS- MODEL has several supported inputs. Typically, you enter a single
sequence query for template search, but you can also bypass this step and
directly enter the desired template or even a template and a custom alignment.
We will start with the first alternative.

#### [In-class Homology Modeling exercise: Quick Modeling with SWISS-MODEL.]{style="color:green"}

As an exercise, we are modeling a human DNA repair protein, NEIL2, and a viral
DNA polymerase (HAdV-2 pol). We will intersperse discussion about modeling with
the exercise steps.

::: callout-note
#### Rather than provide just a recipe to make a model in two clics, the goal is to understand the process behind, while doing exercises with selected examples
:::

## Step 1+2. Template search and align. {#template}

### Where can we search?

The template searching consists in finding a protein with known structure(s)
with a sequence related to our protein. As we mentioned earlier, the [RCSB
Protein Data Bank](https://www.rcsb.org/) (PDB) is the largest database of
protein structures. Thus, we can search for templates by comparing the sequence
of our protein with the sequence of all proteins in the PDB. However, the PDB
was created as a repository to contain all the macromolecular structures, not to
search templates for modeling. Similar to other end-to-end software, SWISS-MODEL
has its own curated database, the **SMLT** (SWISS-MODEL template library). This
is based on profiles alignments from the PDB, is updated weekly, and is also annotated and indexed to
facilitating searching. As of September 13, 2023, SMLT contains 145,813 unique
protein sequences that can map into 375,008 [biological
units](https://proteopedia.org/wiki/index.php/Biological_Unit). Since 2023,
SWISS-MODEL also searches in the AlphaFold DB.

Now, you first need your sequences. A protein sequence database like
[Uniprot](https://www.uniprot.org/) or [NCBI
Protein](https://www.ncbi.nlm.nih.gov/protein/), is a good place for this task.

If you want to cheat yourself, just follow the links to
[HAdV-2pol](https://rest.uniprot.org/uniprotkb/P03261.fasta) and
[NEIL2](https://rest.uniprot.org/uniprotkb/Q969S2.fasta) sequences.

### How can we search accurately and fast?

![Query-template alignment is the base for homology modeling
[@kelley2009]](pics/template_query.png "Query-template alignment is the base for a new model"){#fig-query
.figure width="631"}

To find templates, sequences must be compared, so an accurate and powerful
alignment method is essential. Comparing a protein sequence to an entire
database is time consuming because you are comparing to completely unrelated
proteins, which means a loss of resources. Two fundamental improvements have
increased the capacity of template search: (1) the introduction of secondary
structure (SS) by comparing SS predictions of the query protein and secondary
structures of the protein database, and (2) the use of profiles to facilitate
the comparison.
[Profiles](https://www.ebi.ac.uk/training/online/courses/protein-classification-intro-ebi-resources/what-are-protein-signatures/signature-types/what-are-profiles/)
are a mathematical method of summarizing a multiple sequence alignment that
quantifies the probability of each amino acid at each position. A particular
type of profile [*Hidden Markov
Models*](https://www.ebi.ac.uk/training/online/courses/protein-classification-intro-ebi-resources/what-are-protein-signatures/signature-types/what-are-hmms/)
(HMM) are very useful for searching databases for similar sequences. Transition
probabilities (i.e., the probability that a particular amino acid follows
another particular amino acid) are also modeled. In addition, HMMs include
insertions and deletions of amino acids. These features allow HMMs to model
entire alignments in great detail, including very divergent regions, and
facilitate the identification of highly conserved positions that define not only
protein function but also folding. For example, glycine residues at the end of
each beta strand or a pattern of polar residues that favor alpha helices. Prior
comparison of the query sequence to a database of sequences allows us to
incorporate evolutionary information about the sequence. Therefore, we started
from a requirement of \>30% identity to obtain good models before the
implementation of profiles, to good models even with ⁓20% identity or below.
Moreover, the generation of profiles also facilitates clustering of the search
database, reducing search time. The implementation of these capacities led to
the implementation of the so-called **fold recognition** in homology modeling.

![From sequence vs. sequence search to profile-profile comparison
[@kelley2009].](pics/profiles.jpg "From sequence vs. sequence search to profile-profile comparison"){#fig-profiles
.figure width="669"}

Template searching in SWISS-MODEL has evolved over the years toward more
accurate results. Currently, this is done with **HHblits** [@remmert2011], a
iterative profile-profile method. We can also search for templates using Blast
or other profile-profile methods, like
[Psi-BLAST](https://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE_TYPE=BlastSearch&PROGRAM=blastp&BLAST_PROGRAMS=psiBlast),
[HHPred](https://toolkit.tuebingen.mpg.de/tools/hhpred), or
[JackHMMER](https://www.ebi.ac.uk/Tools/hmmer/search/jackhmmer). Then, the
templates are ranked between 0 and 1 using two different numerical parameters:
**GMQE** (global model quality estimate) and QSQE (quaternary structure quality
estimate). Briefly, GMQE uses likelihood functions to evaluate various
properties of the target-template alignment (sequence identity, sequence
similarity, HHblits score, the agreement between the predicted secondary
structure of target and template, the agreement between the predicted solvent
accessibility between target and template; all normalized by the alignment
length) to predict the expected quality of the resulting model. QSQE evaluates
the likelihood of the oligomeric state of the model.

[![SWISS-MODEL modeling start. Just paste your sequences and click on
Search.](pics/seach.png "SWISS-MODEL modeling"){#fig-swissmodel
.figure}](https://swissmodel.expasy.org/interactive)

::: callout-tip
If you click on "Build Model", it will directly use the top-ranked template, so you'll miss some fun, but you can go back for that later on using the "Templates" tab.
:::

#### [Could you foresee which of our queries will give rise to a better model? Why?]{style="color:green"}

## Step 3. Model Building.

By default, SWISS-MODEL will provide 50 ranked possible templates. The output
also contains information about the method and resolution of the templates, the
% of identity (and alignment coverage) with the query sequence, and the GMQE and
QSQE.

The top template is marked by default and it likely will give the best model,
but it is also interesting to try some alternative templates depending on the
downstream application of the model (see [below](#corollary)). For instance with
a different substrate/cofactor that can have a key role in the protein function
or with different coverage or % identity.

Once the template(s) is selected, model coordinates are constructed based on the
alignment of the query and template sequence using ProMod3 module [@studer2021].
SWISS-MODEL uses a fragment assembly, which is also the bases of
*Fold-recognition* or *Threading* methods (see Threading
[section](advanced.html#sec-threading)). Other programs, like
[Modeller](https://salilab.org/modeller/), are based in the satisfaction of
general spatial restraints [@janson2019]. Modeller is a command-line tool that
allow full customization of the modeling, which requires more knowledge about
the process but can be very useful for some types of proteins (see [@webb2017]).
However, it has been implemented in some online servers
([ModWeb](https://modbase.compbio.ucsf.edu/modweb/)) and user-friendly
applications, including ChimeraX and Pymol (Pymod plugin, @janson2021). Modeller
can be also called from the HHPred output (if you included PDB as a searchable
database), which is very convenient to model remote homologs using several
templates in a few minutes.

Fragment assembly will use the template core backbone atoms to build a core
structure of the model, leaving non-conserved regions (mostly loops) for later.
**Loops modeling** includes the use of a homologs subset of a dedicated loop
database, [Monte Carlo](https://en.wikipedia.org/wiki/Monte_Carlo_method)
sampling as a fallback and even *ab initio* building or missing loops.

![Backbone and loop modeling. From [Expasy Protein Structure, Comparative
Protein Modelling, and
Visualisation](https://swissmodel.expasy.org/static/course/files/PartII_homology_modelling.pdf).](pics/loops.png "Backbone and loop modeling"){#fig-backbone
.figure}

Then, positioning of **side chain** of non conserved amino acids is undertook.
The goal is finding the most likely side chain conformation, using template
structure information, rotamer libraries (from a curated set known protein
structures) and energetic and packaging criteria. If many side chains have to be
placed in the structure it will lead to a "chicken and egg problem", as
positioning one rotamer would affect others. That means that identification of
possible hydrogen bonds between residues side chains and between side chains and
the backbone reduce the optimization calculations. At the end of the day, the
more residues correctly positioned, the best model.

![Side chain modeling. From [CMBI Seminars on
Bioinformatics](https://swift.cmbi.umcn.nl/teach/B1SEM/B1SEM_8.html).](pics/Rotamers.png "Backbone and loop modeling"){#fig-rotamers
.figure}

Finally, a short energy minimization is carried out to reduce the unfavorable
contacts and bonds by adapting the angle geometries and relax close contacts.
This energy minimization step or refinement can be useful to achieve better
models but only when the folding is already accurate.

## Step 4. Result assessing.

The computer always give you a model but it doesn't mean that you have a model
that makes sense. How can we know if we can rely on the model? Output models are
colored in a temperature color scale, from navy blue (good quality) to red (bad
quality). That can help us to understand our model in a first sight. Also, this
is an interactive site and you can zoom-in, zoom-out the model. Many other
features are available to work on your model. For instance, you can compare
multiple models, you can change the display options. You can also download all
the files and reports on the "Project Data" button.

[![NEIL2 model created with SWISS-MODEL (July
2022)](pics/neil2.png "NEIL2 model"){#fig-neil
.figure}](https://swissmodel.expasy.org/interactive/TWq8LD/models/)

There is also a "Structure Assessment" option. This provide you a detailed
report of the structural problems of your model. You can see [Ramachandran
plots](intro.html#sec-rama "Ramachandran plots in Wikipedia") that highlight in
red the amino acid residues with abnormal phi/psi angles in the model and a
detailed list of other problems.

The GMQE is updated to the QMEAN Zscore and QMEANDisCo [@studer2020]. The QMEAN
Z-score or the normalized QMEAN score indicates how the model is comparable to
experimental structures of similar size. A QMEAN Z-score around 0 indicates good
agreement, while score below -4.0 are given to models of low quality. Besides
the number, a plot shows the QMEAN score of our model (red star) within all
QMEAN scores of experimentally determined structures compared to their size.
Overall, the Z-score is equivalent to the standard deviation of the mean.

![Per-residue QMEANDisCo scores are mapped as red-to-green colour gradient on a
model of lbp-8 in Caenorhabditis elegans (UniProtKB: O02324, PDB: 6C1Z).
Distance constraints have been constructed from an ensemble of experimentally
determined protein structures that are homologous to lbp-8. The inset depicts
two example constraints between residues marked with colour-coded spheres in the
model. From @studer2020.](images/paste-C2BED0C6.png){#fig-disco .figure
width="572"}

The QMEANDisCO was implemented in SWISS-MODEL in 2020 and it is a powerful,
single parameter that combines statistical potentials and agreement terms with a
distance constraints (DisCo) to provide a consensus score. DisCo evaluates
consistencies of pairwise CA-CA distances from a model with constraints
extracted from homologous structures. All scores are combined using a neural
network trained to predict per-residue scores. We can check a global score, but
also a local score for each residue, that help us to understand which regions of
the model are more likely to accurately folded (i.e. they are more reliable).

[![HAdV-2 DNA polymerase model obtained with SWISS-MODEL (July
2022).](pics/hadv-pol.png "HAdV-2 DNA polymerase"){#fig-pol
.figure}](https://swissmodel.expasy.org/interactive/GZ8GmU/models/)

QMEANDisCo can be used to analyze models obtained with other methods in order to
make them comparable (note that you can use QMEANDisCo for models obtained with
any method, you just need a `.pdb` file). There are other independent model
assessing tools commonly used to assess protein models, like
[VoroMQA](https://bioinformatics.lt/wtsam/voromqa) [@olechnovi2017] or
[MoldFold](https://www.reading.ac.uk/bioinf/ModFOLD/ModFOLD8_form.html)
[@mcguffin2021]. VoroMQA is very quick method that combines the idea of
statistical potentials (i.e. a knowledge-based score function) with the use of
interatomic contact areas to provide a score in the range of \[0,1\]. When
applied to PDB database, most of the high-quality experimentally-based
structures have a VoroMQA score \>0.4. Thus, if the score is greater than 0.4,
then the model is likely good and models with score \<0.3 are likely bad ones.
Models with score 0.3-0.4 are uncertain and should not be classified with
VoroMQA. On the other hand, ModFold is a meta-tool that provides you a very
detailed report (and parseable files) with local and global scores, but it can
take hours/days to obtain the result, so tend to use it only with selected
models.

::: callout-tip
## AlphaFold DB models

Models from the AlphaFold DB are appended to the available structures / models
if available. For these models we use the confidence values provided by
AlphaFold (pLDDT) rescaled to be between 0 and 1. Since both pLDDT and
QMEANDisCo are trained to predict lDDT (Cα-only for pLDDT and all-atom for
QMEANDisCo) and are displayed in the same range, they should be considered
comparable (From
[https://swissmodel.expasy.org/docs/repository_help](https://swissmodel.expasy.org/docs/repository_help%5D).)
:::

Another key parameter that you should know if you want to compare protein
structures is the **alpha carbon RMSD** (see [Structure alignment
section](ddbb.html#sec-alignment)). Any protein structural alignment will give
you this parameter as an estimation of the difference of the structures. You can
align structure with many online servers, like
[RCSB](https://www.rcsb.org/alignment), [FATCAT2](http://fatcat.godziklab.org/)
or using molecular visualization apps, including [Mol\*](https://molstar.org/),
[ChimeraX](https://www.cgl.ucsf.edu/chimerax/) or [PyMOL](https://pymol.org/2/)
(see [Protein Structure Display](pdb.html#sec-apps) section).

#### [Which model is better? Which regions are more difficult to model? Why?]{style="color:green"}

## Corollary: What can I do with my model and what I cannot? {#corollary}

[![Accuracy and application of protein structure models (in 2001). From
.](pics/sali.jpeg "Accuracy and application of protein structure models."){#fig-baker
.figure}](https://www.science.org/doi/10.1126/science.1065659)

A big power entails a big responsibility. The use of models entails a precaution
and a need for experimental validation. However, knowing the limitations of our
model is required for a realist use of it; and limits are defined by the model
quality.

The accuracy of a comparative model is related to the percentage sequence
identity on which it is based. High-accuracy comparative models can have about
1-2 Å root mean square (RMS) error for the main-chain atoms, which is comparable
to the accuracy of a nuclear magnetic resonance (NMR) structure or an x-ray
structure. These models can be used for functional studies and the prediction of
protein partners, including drugs or other proteins working in the same process.
Also, for some detailed studies, it would be convenient to refine your model by
[*Molecular Dynamics*](https://en.wikipedia.org/wiki/Molecular_dynamics) and
related methods towards a native-like structure. I suggest checking the review
by @adiyaman2019 on this topic.

On the contrary, low-accuracy comparative models are based on less than 20-30%
sequence identity, hindering the modeling capacity and accuracy. Some of these
models can be used for protein engineering purposes or to predict the function
of orphan sequences based on the protein fold (using
[Dali](http://ekhidna2.biocenter.helsinki.fi/dali/) or
[Foldseek](https://search.foldseek.com/search)).

As mentioned above, it also advisable to check the template structures and read
the papers describing them in order to squeeze all the information from your
model.

[**What do you think you could use our models of NEIL2 and
HAdV-2pol?**]{style="color:green"}

# [Homology Modeling Practice](https://www.evernote.com/shard/s62/sh/a1767902-113e-8dc9-b483-3e4857aa8466/1pCJlsPEm9OAAt5oURr5_ily0bXpYSQpivXMsAZe7AV8A-GO2Wcv7mWFOw){style="color:green"}