Topological Data Analysis performed on point clouds derived from crystalized structures and alphaFold DB. This repository contains code behind the work presented in the preprint The Topological Properties of the Protein Universe.
Reproducing the entire analysis on AlphaFold DB will require approximately 160 CPU cores, 1TB RAM, and around 46TB of storage.
The main analysis will take around 3 days.
The analysis was performed on Arm architecture (Ampere A1 Compute from Oracle).
The top-level script for the analysis is data/alphafold/ripsererAFs.sh
.
All proteomes from AlphaFold DB were downloaded according to the
bulk download instructions
to data/alphafold/dl/proteomes/
(details in data/alphafold/proteomes.sh
).
Example testfiles in data/alphafold/testfiles
.
- Zsh for most top-level execution scripts.
- Julia for most of the main code. Cluster computations run using v1.8.2 for Aarch64 Linux.
- Python and a python package manager, e.g. pip and miniforge.
- R to reproduce visualizations.
- HDF5 for organizing analysis output data.
Preferably with parallel support, see
release_docs/INSTALL_parallel
if downloading the source code. - tmux for job control.
- Pymol for superimposing structures. E.g. install free
open source pymol for Linux,
Mac, or
Windows. With homebrew on
Mac:
brew install brewsci/bio/pymol
. - Downloading all of AlphaFold DB structures was done with the Google Cloud SDK, see https://cloud.google.com/storage/docs/gsutil_install.
- Some results use a script from the submodule at
tools/hyperTDA
, which has its own install instructions. - PH tool comparisons: follow installation instructions in relevant submodules under
tools/
to reproduce benchmarking results. Ripser++ requires GPU hardware.
./install.sh
for basic git setup../install.jl
for julia packages.Manifest.toml
andProject.toml
are also provided for acquiring the specific versions used.- Python dependencies listed in
requirements.txt
for community detection. E.g.pip install -r requirements.txt
- Some of the raw publicly available data can be downloaded by running
data/RUNME.sh
and similar files underdata/
. - Postgres for the database work was installed according to
data/alphafold/postgres/install.sh
.