Skip to content

Commit

Permalink
Add tutorial for structural alphabets
Browse files Browse the repository at this point in the history
  • Loading branch information
padix-key committed Nov 2, 2024
1 parent ebb7b36 commit f135b31
Show file tree
Hide file tree
Showing 3 changed files with 161 additions and 1 deletion.
5 changes: 4 additions & 1 deletion doc/tutorial/sequence/align_multiple.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,13 +2,16 @@

Guide trees and multiple sequence alignments
============================================

.. currentmodule:: biotite.sequence.align

In :doc:`a previous chapter <align_optimal>` we have learned how to align *two*
sequences with each other.
However, for many use cases we require an alignment of more than two
sequences.
For example, one such use case is the analysis of homologous regions within a
protein family.
Although the *dynamic programming* algorithm behind :doc:`align_optimal` can in
Although the *dynamic programming* algorithm behind :func:`align_optimal()` can in
theory be extended to any number of sequences, the computation time scales
linearly with the length *each* aligned sequence.
Thus, the method becomes infeasible for already a few sequences.
Expand Down
156 changes: 156 additions & 0 deletions doc/tutorial/structure/alphabet.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,156 @@
.. include:: /tutorial/preamble.rst

Structural alphabets
====================

.. currentmodule:: biotite.structure.alphabet

In the previous chapters we have seen the multitude of methods, than can be applied
to sequence and structure data.
*Structural alphabets* combine the best of both worlds:
A structural alphabet is a representation of protein or nucleic acid structures,
where each residue is encoded into a single character, based on the local geometry
or contact partners of that residue.
This way the high performance sequence-based methods, for example alignment searches,
can be applied to structure data.

The :mod:`biotite.structure.alphabet` subpackage provides multiple structural alphabets,
but for the scope of this tutorial we will focus on the *3Di* alphabet, popularized by
the `Foldseek <https://github.com/steineggerlab/foldseek>`_ software for fast protein
structure comparison.
Just keep in mind that in the following examples the underlying structural alphabet
can be substituted with minimal modifications.

Converting structures to sequences
-----------------------------------

.. jupyter-execute::

import biotite.database.rcsb as rcsb
import biotite.structure as struc
import biotite.structure.alphabet as strucalph
import biotite.structure.io.pdbx as pdbx

# You can actually directly read the downloaded PDBx file content
# without an intermediate file
pdbx_file = pdbx.BinaryCIFFile.read(rcsb.fetch("2ZVS", "bcif"))
ec_ferredoxin = pdbx.get_structure(pdbx_file, model=1)
# This structural alphabet expects a peptide chain
ec_ferredoxin = ec_ferredoxin[struc.filter_amino_acids(ec_ferredoxin)]
structural_sequences, chain_starts = strucalph.to_3di(ec_ferredoxin)
print(structural_sequences)
print(chain_starts)

.. jupyter-execute::

ec_ferredoxin_3di = structural_sequences[0]
print(ec_ferredoxin_3di)

Each symbol in the sequence corresponds to one residue in the structure,
that can be extracted using the residue-level functionality of
:mod:`biotite.structure`.


Sequence alignments on structural alphabets
-------------------------------------------

As mentioned in the :doc:`sequence chapter <../sequence/encoding>`, the sequence
based methods in :mod:`biotite.sequence` generally do not care about the type of
sequence.
This means that we can use any method we have learned so far on the structural
sequences.
For the scope of this tutorial we will merely use :func:`align_optimal()` to find
corresponding residues in two structures.
As structure is much better conserved than its sequence, the alignment will even work
on remote homologs with low sequence identity, where a classical sequence alignment
would fail.

.. jupyter-execute::

import matplotlib.pyplot as plt
import biotite.sequence.align as align
import biotite.sequence.graphics as graphics

matrix = align.SubstitutionMatrix.std_3di_matrix()
alignment = align.align_optimal(
ec_ferredoxin_3di,
st_ferredoxin_3di,
matrix,
gap_penalty=(-10, -1),
terminal_penalty=False,
)[0]

fig, ax = plt.subplots(figsize=(8.0, 2.0))
graphics.plot_alignment_similarity_based(
ax, alignment, matrix=matrix, labels=["EC", "ST"], symbols_per_line=50
)

Example: Superimposing structures
---------------------------------

Now that we know the aligned residues from the alignment as *anchors* to superimpose
the structures.

.. jupyter-execute::

def rmsd_from_alignment(fixed, mobile, alignment):
"""
A very simple function that extracts corresponding residues (the 'anchors') from
an alignment and uses them to run a superimposition.
Finally the RMSD on the superimposed structures plus the number of anchors is
returned.
"""
alignment_codes = align.get_codes(alignment)
anchor_mask = (
# Anchors must be structurally similar
(matrix.score_matrix()[alignment_codes[0], alignment_codes[1]] > 0)
# Gaps are not anchors
& (alignment_codes != -1).all(axis=0)
)
superimposition_anchors = alignment.trace[anchor_mask]
# Each anchor corresponds to a residue
# Use the CA atoms as representative for each residue
fixed_ca = fixed[fixed.atom_name == "CA"]
mobile_ca = mobile[mobile.atom_name == "CA"]
fixed_anchors = fixed_ca[superimposition_anchors[:, 0]]
mobile_anchors = mobile_ca[superimposition_anchors[:, 1]]

mobile_anchors, transform = struc.superimpose(
fixed_anchors,
mobile_anchors,
)
return struc.rmsd(fixed_anchors, mobile_anchors), len(superimposition_anchors)

rmsd, n_anchors = rmsd_from_alignment(ec_ferredoxin, st_ferredoxin, alignment)
print("Number of corresponding residues found:", n_anchors)
print("RMSD:", rmsd)

Again, with a classical amino acid sequence based approach the accuracy of the
superposition would be much lower:
In this case less corresponding residues can be found from the the amino sequence
alignment and the RMSD between them is significantly higher.

.. jupyter-execute::

matrix = align.SubstitutionMatrix.std_protein_matrix()
ec_ferredoxin_seq = struc.to_sequence(ec_ferredoxin)[0][0]
st_ferredoxin_seq = struc.to_sequence(st_ferredoxin)[0][0]
alignment = align.align_optimal(
ec_ferredoxin_seq,
st_ferredoxin_seq,
matrix,
gap_penalty=(-10, -1),
terminal_penalty=False,
)[0]
rmsd, n_anchors = rmsd_from_alignment(ec_ferredoxin, st_ferredoxin, alignment)
print("Number of corresponding residues found:", n_anchors)
print("RMSD:", rmsd)

fig, ax = plt.subplots(figsize=(8.0, 2.0))
graphics.plot_alignment_similarity_based(
ax, alignment, matrix=matrix, labels=["EC", "ST"], symbols_per_line=50
)

This shows only a small fraction of the power of structural alphabets.
They can also be used to find structural homologs in a large database, to superimpose
multiple structures at once and much more!
1 change: 1 addition & 0 deletions doc/tutorial/structure/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -54,3 +54,4 @@ contains functions for structure analysis and manipulation.
measurement
segments
nucleotide
alphabet

0 comments on commit f135b31

Please sign in to comment.