Skip to content

Commit

Permalink
Merge pull request #127 from neherlab/refine-docs
Browse files Browse the repository at this point in the history
Refine docs
  • Loading branch information
ivan-aksamentov authored Feb 11, 2025
2 parents 207646e + b000764 commit d15971f
Show file tree
Hide file tree
Showing 17 changed files with 132 additions and 60 deletions.
18 changes: 10 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# pangraph
# PanGraph

[![Documentation](https://img.shields.io/badge/Documentation-Link-blue.svg)](https://neherlab.github.io/pangraph/)
[![Documentation](https://img.shields.io/badge/Documentation-Link-blue.svg)](https://docs.pangraph.org/)
![Docker Image Version (latest semver)](https://img.shields.io/docker/v/neherlab/pangraph?label=docker)
![Docker Pulls](https://img.shields.io/docker/pulls/neherlab/pangraph)

Expand All @@ -21,14 +21,13 @@ Each genome is then an ordered walk along _blocks_. The collection of all genome

Pangraph is available:
- as a **standalone binary**
- as a **conda package**
- as a **docker container**

For more extended instructions on installation please refer to the documentation.
For more extended instructions on installation please refer to the [documentation](https://docs.pangraph.org/category/installation).

### Standalone binary

### Conda package
This is the recommended way to install Pangraph. You can download the latest release for your operating system [from here](https://docs.pangraph.org/installation/standalone).

### Docker container

Expand All @@ -38,12 +37,12 @@ PanGraph is available as a Docker container:
docker pull neherlab/pangraph:latest
```

See the documentation for extended instuctions on its usage.
See the [documentation](https://docs.pangraph.org/installation/with-docker) for extended instuctions on its usage.


## Examples

Please refer to the tutorials within the documentation for an in-depth usage guide.
Please refer to the [tutorials within the documentation](https://docs.pangraph.org/category/tutorial) for an in-depth usage guide.
For a quick reference, see below.

Align a multi-fasta `sequences.fa` in a graph:
Expand All @@ -70,7 +69,7 @@ Reconstruct input sequences from the graph:

## PyPangraph

PyPangraph is a python package with convenient utilities to load and explore the graph data structure, see the documentation for more details.
PyPangraph is a python package with convenient utilities to load and explore the graph data structure, see the [documentation](https://docs.pangraph.org/pypangraph/about-pypangraph) for installation instructions and more examples.

```python
import pypangraph as pp
Expand All @@ -90,3 +89,6 @@ bioRxiv 2022.02.24.481757; doi: https://doi.org/10.1101/2022.02.24.481757
## License

[MIT License](LICENSE)

> [!NOTE]
> The legacy v0 version of Pangraph is now stored on the [`v0` branch](https://github.com/neherlab/pangraph/tree/v0) of the repository, and legacy documentation is available [here](https://v0.docs.pangraph.org/).
17 changes: 15 additions & 2 deletions docs/docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,9 @@ Pangraph is currently under heavy development. Bugs and crashes are to be expect

:::

**PanGraph** is a command-line tool for the analysis of bacterial genomes. It compresses multiple genome in a compact **graph representation**, that can be queried to extract information about the evolution of the genomes. It is developed and maintained by [the Neher lab](https://www.biozentrum.unibas.ch/about/administration/administration-a-z/overview/unit/research-group-richard-neher).

## Why Pangraph?

The content and structure of bacterial genomes evolves very rapidly:
Part of the genome can be cut out, duplicated, or inverted.
Expand All @@ -24,8 +27,18 @@ This is expected to be useful to parsimoniously infer horizontal gene transfer e

The resultant graph represents contiguous intervals of homologous DNA as vertices and every genome as an ordered walk across such vertices.
Edges of the graph are unordered and only exist if at least one genome was found to connect both vertices in either the forward or reverse strand.
For a more detailed description of the graph structure, see [what is a pangraph](/tutorial/tutorial_1#what-is-a-pangraph).
For a more detailed description of the graph structure, see [what is a pangraph](tutorial/t01-building-pangraph.md#what-is-a-pangraph).

## Documentation outline

This documentation contains:
- a [set of tutorials](/category/tutorial) that explain the essential steps to build and manipulate a graph.
- a [reference documentation](/reference) of the available commands.
- in addition, we provide a python library [PyPangraph](/category/pypangraph) for analysis of the graph data structure in Python


This documentation is structures as a [set of tutorials](/category/tutorial) that explain the essential steps to build and manipulate a graph, along with a [reference documentation](/reference) of the available commands. In addition, we provide a python library [pyPanGraph](/category/pypangraph) for analysis of the graph data structure in Python.
:::info[Legacy Pangraph version]

This documentation refers to the latest version of pangraph. Code for the previous `v0` version is available on [the `v0` branch](https://github.com/neherlab/pangraph/tree/v0) of the repository, and the legacy documentation is hosted at https://v0.docs.pangraph.org/.

:::
2 changes: 1 addition & 1 deletion docs/docs/pypangraph/_category_.json
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,6 @@
"position": 4,
"link": {
"type": "generated-index",
"description": "PyPanGraph is a Python package to facilitate analysis of pangraph JSON files. "
"description": "PyPanGraph is a Python package to facilitate analysis of pangraph JSON files."
}
}
43 changes: 43 additions & 0 deletions docs/docs/pypangraph/about-pypangraph.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
---
sidebar_position: 1
---

# About PyPangraph

PyPanGraph is a Python package to facilitate exploration and analysis of [PanGraph](https://github.com/neherlab/pangraph) output JSON files.

PyPangraph can be installed following [these instructions](installation.md).

Below you'll find some simple usage of PyPangraph. For a more complete guide you can follow the [tutorials](t01-load-graph.md).

```python
import pypangraph as pp

# load a graph
graph = pp.Pangraph.load_graph("graph.json")
# pangraph object with 15 paths, 137 blocks and 1042 nodes

# recover a specific path with its identifier
path = graph.paths["RCS48_p1"]
# path object | name = RCS48_p1, n. nodes = 60, length = 80596 bp

# extract a block alignment
block = graph.blocks[124231456905500231]
# block 124231456905500231, consensus len = 183 bp, n. nodes = 4

aln = block.to_biopython_alignment()
# Alignment with 15 rows and 2932 columns
# TTCTGCAATTGAGTCTTGTATGCCCCCATAACAGCACTAAATAA...
# TTCTGTAATTGAGTCTTGTATGCCCCCATAACAGCACTAAATAA...
# TTCTGTAATTGAGTCTTGTATGCCCCCATAACAGCACTAAATAA...
# TTCTGCAATTGAGTCTTGTATGCCCCCATAACAGCACTAAATAA...
# ...

# get blocks statistics (length, copy number...)
stats_df = graph.to_blockstats_df()
# block_id count n_strains duplicated core len
# 124231456905500231 15 15 False True 2202
# 149501466629434994 2 2 False False 210
# 279570196774736738 2 2 False False 1308
# ... ... ... ... ... ...
```
8 changes: 6 additions & 2 deletions docs/docs/pypangraph/installation.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,17 @@
---
sidebar_position: 1
sidebar_position: 2
---

# Installation
PyPanGraph is a Python package to facilitate analysis of pangraph JSON files. It can be installed from [PyPi](https://pypi.org/) or Bioconda as

PyPanGraph can be installed from [PyPi](https://pypi.org/) or Bioconda as

```
pip install pypangraph
```

or

```
conda install -c bioconda pypangraph
```
Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
sidebar_position: 1
sidebar_position: 3
---

# Loading and exploring a graph
Expand All @@ -18,7 +18,7 @@ print(graph)

## The components of a graph

As explained in the [tutorial](../tutorial/tutorial_1.md#what-is-a-pangraph), a pangenome graph is composed of three main components: nodes, blocks and paths.
As explained in the [tutorial](../tutorial/t01-building-pangraph.md#what-is-a-pangraph), a pangenome graph is composed of three main components: nodes, blocks and paths.

- **Blocks** encode multiple sequence alignments that group together homologous parts of the input genomes.
- **Paths** are representation of the input genomes as a sequences of blocks. More precisely, as sequence of **nodes**.
Expand Down Expand Up @@ -120,4 +120,4 @@ print(block.alignment.generate_alignment())
# '16194835320646696346': 'ATATATGGTGCGTTAATTTTTAAACCCT...'}
```

More details on alignments are provided in [tutorial 3](tutorial3.md).
More details on alignments are provided in [tutorial 3](t03-block-alignments.md).
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
sidebar_position: 2
sidebar_position: 4
---

# A look at the pangenome
Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
sidebar_position: 3
sidebar_position: 5
---

# Exploring block alignments
Expand Down Expand Up @@ -44,7 +44,7 @@ AlignIO.write(aln, "aln.fa", "fasta")

:::info block alignment vs block sequences

As explained in [Pangraph tutorial](../tutorial/tutorial_3.md), insertions are not exported in alignments since they are not aligned to the consensus sequence of the block by pangraph.
As explained in [Pangraph tutorial](../tutorial/t03-exporting-sequences.md), insertions are not exported in alignments since they are not aligned to the consensus sequence of the block by pangraph.

If these insertions are important for your analysis, you can instead export **unaligned but complete** block sequences as biopython SeqRecord objects with:

Expand Down Expand Up @@ -72,7 +72,7 @@ AlignIO.write(aln, "aln.fa", "fasta")
Other than the alignment for single blocks, we can also extract the alignment of the full core genome, i.e. the concatenated alignment of all single-copy core blocks. Pangraph has a [dedicated export subcommand](../reference.md#pangraph-export-core-genome) for this:

```bash
pangraph export core-genome --guide-strain RCS34_p1 plasmids.json > core_aln.fa
pangraph export core-genome --guide-strain RCS34_p1 plasmids.json -o core_aln.fa
```

Alternatively pypangraph provides the following method:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
sidebar_position: 4
sidebar_position: 6
---

# Paths and core genome synteny
Expand Down Expand Up @@ -68,7 +68,7 @@ plt.show()
From this plot we observe a strong conservation in the order of core blocks. This is even more explicit if we look at the graph in [Bandage](https://rrwick.github.io/Bandage/). We can use the export function of pangraph to export the graph in GFA format. By adding the `--no-duplicated` flag and the `--minimum-depth 15` option we can make sure that only core blocks are exported.

```bash
pangraph export gfa --no-duplicated --minimum-depth 15 plasmids.json > plasmids_core.gfa
pangraph export gfa --no-duplicated --minimum-depth 15 plasmids.json -o plasmids_core.gfa
```

Moreover we can save the block colors that we used in the previous plot in a csv file, that can be loaded by Bandage to color the blocks.
Expand All @@ -92,7 +92,7 @@ For these cases, pypangraph provides a method to quickly survey all changes in c

![minimal synteny units](../assets/pp_t4_minimal_synteny_units.png)

For this part of the tutorial we will analyze the `graph.json` file created [in the first tutorial](../tutorial/tutorial_1.md#building-the-pangraph), containing 10 _E. coli_ chromosomes. The minimal sinteny units for this graph can be extracted with the function:
For this part of the tutorial we will analyze the `graph.json` file created [in the first tutorial](../tutorial/t01-building-pangraph.md#building-the-pangraph), containing 10 _E. coli_ chromosomes. The minimal sinteny units for this graph can be extracted with the function:

```python
graph = pp.Pangraph.from_json("graph.json")
Expand Down Expand Up @@ -142,7 +142,8 @@ Similarly to what done for plasmids, we can visualize these units on Bandage. We
pangraph export gfa \
--no-duplicated \
--minimum-depth 10 \
graph.json > ecoli.gfa
-o ecoli.gfa \
graph.json
```

And then we can export the dictionary of core-block colors with:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
---
sidebar_position: 5
sidebar_position: 7
---

# Comparing two genomes in a dotplot

Decomposing genomes in separate blocks provides a very good starting point for pairwise comparison between genomes. The pangenome graph can easily be used to draw a dotplot between two different paths, in which lines represent shared blocks.

In this example we consider the `klebs_pangraph.json` graph generated from 9 complete chromosomes of _Klebsiella Pneumoniae_ in [a previous tutorial](../tutorial/tutorial_4.md).
In this example we consider the `klebs_pangraph.json` graph generated from 9 complete chromosomes of _Klebsiella Pneumoniae_ in [a previous tutorial](../tutorial/t04-graph-projection.md).

PyPangraph provides a convenient `dotplot` function to generate such a dotplot:

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -26,10 +26,10 @@ More specifically, a path is encoded as a list of oriented block occurrences, i.

The tutorial requires you to have the `pangraph` command available in your path. Instructions on how to install pangraph can be found in [Installation](../category/installation).

For this tutorial we will use a small dataset containing full chromosomes of 10 _Escherichia Coli_ strains (source: GenBank). For convenience this dataset is available in the pangraph repository (`data/ecoli.fa.gz`), and can be downloaded with the command:
For this tutorial we will use a small dataset containing full chromosomes of 10 _Escherichia Coli_ strains (source: GenBank). For convenience this dataset is available in the pangraph repository (`data/ecoli.fa.gz`), and can be downloaded [from this link](https://github.com/neherlab/pangraph/raw/master/data/ecoli.fa.gz) or by running:

```bash
wget https://github.com/neherlab/pangraph/raw/master/data/ecoli.fa.gz
curl https://github.com/neherlab/pangraph/raw/master/data/ecoli.fa.gz -o ecoli.fa.gz
```

This is a single fasta file containing 10 fully assembled bacterial chromosomes, but no plasmids.
Expand All @@ -41,7 +41,7 @@ As a first step, we will build a pangraph object from the DNA of the 10 chromoso
This can be done using the command `build` (see [`build` command](../reference#pangraph-build)):

```bash
pangraph build -j 4 --circular ecoli.fa.gz > graph.json
pangraph build -j 4 --circular ecoli.fa.gz -o graph.json
```
- the option `--circular` is used when passing circular DNA sequences, like the bacterial chromosomes that we consider here.
- the option `-j 4` specifies the number of threads to use.
Expand Down Expand Up @@ -113,7 +113,7 @@ nodes = {
}
```

More details on the structure of this `json` file will be covered in the [next tutorial section](tutorial_2.md).
More details on the structure of this `json` file will be covered in the [next tutorial section](t02-pangraph-output-file.md).


### Sequence diversity and alignment sensitivity
Expand Down Expand Up @@ -144,7 +144,7 @@ As a first example, we consider exporting the pangraph in [Graphical Fragment As
pangraph export gfa \
--no-duplicated \
graph.json \
> graph.gfa
-o graph.gfa
```

This will create a `graph.gfa` file, which can be visualized using [Bandage](https://rrwick.github.io/Bandage/).
Expand All @@ -161,9 +161,9 @@ pangraph export gfa \
--minimum-depth=10 \
--include-sequences \
graph.json \
> graph_core.gfa
-o graph_core.gfa
```

The resulting graph is much simpler. The remaining crossings are due to changes in core-genome synteny. Each change in order of core blocks results in a crossing in the graph, as will be discussed in [a later tutorial section](../pypangraph/tutorial4.md).
The resulting graph is much simpler. The remaining crossings are due to changes in core-genome synteny. Each change in order of core blocks results in a crossing in the graph, as will be discussed in [a later tutorial section](../pypangraph/t04-core-synteny.md).

![img](./../assets/t1_gfa_core.png)
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ As an example, we will use snippets from the `graph.json` file that was produced

## The structure of `graph.json`

As discussed in the [previous tutorial section](./tutorial_1.md#what-is-a-pangraph), the three main entries of pangraph output file are `paths`, `blocks` and `nodes`.
As discussed in the [previous tutorial section](./t01-building-pangraph.md#what-is-a-pangraph), the three main entries of pangraph output file are `paths`, `blocks` and `nodes`.

- each entry in the `paths` list encodes one of the nucleotide sequences that were given as input to the `build` command, represented as a list of nodes (i.e. particular instances of a block)
- each entry in the `blocks` list represents an alignable set of homologous sequences. A block contains the consensus of all of these sequences, together with information to reconstruct the full alignment. Each entry in the alignment is represented by a `node`.
Expand Down Expand Up @@ -141,7 +141,7 @@ Below is a schematic representation of how these variations are applied to the c

![img](./../assets/t2_alignment_reconstruction.png)

As discussed in the [next section](./tutorial_3.md), using information in the `alignments` dictionary the different sequences of a block can be reconstructed in two ways:
As discussed in the [next section](./t03-exporting-sequences.md), using information in the `alignments` dictionary the different sequences of a block can be reconstructed in two ways:
- as **node sequences**. In this case sequences are not aligned, but each entry corresponds to the exact sequence of a node, with all variations applied.
- as a **multiple sequence alignment**. In this case sequences are aligned, but insertions are omitted.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ Block consensus sequences can be exported using the [`export block-consensus` su
```bash
pangraph export block-consensus \
graph.json \
> block_cons.fa
-o block_cons.fa
```

This generates the `block_cons.fa` FASTA file. This file contains one entry per block, with the block ID as the header and the consensus sequence as the sequence:
Expand Down Expand Up @@ -60,7 +60,7 @@ ATTCATGTCCTTGACTGCTTTGTTAATGTCGCACTGGA...

The FASTA id of each entry is the node id, while the description contains a json string with additional information: the path name, block id, start and end positions of the node, and strandedness.

Note that while these alignments contain deletions, they _do not include insertions_. This is due to the fact that alignments are relative to the block consensus, against which insertions cannot be placed (see [the previous tutorial section](./tutorial_2#how-alignments-are-encoded)). However pangraph also provides the option to export complete, _but unaligned_, sequences for each block:
Note that while these alignments contain deletions, they _do not include insertions_. This is due to the fact that alignments are relative to the block consensus, against which insertions cannot be placed (see [the previous tutorial section](./t02-pangraph-output-file.md#how-alignments-are-encoded)). However pangraph also provides the option to export complete, _but unaligned_, sequences for each block:

```bash
pangraph export block-sequences \
Expand All @@ -80,7 +80,7 @@ Pangraph also provides a quick command to extract the core-genome alignment of t
pangraph export core-genome \
graph.json \
--guide-strain NC_010468 \
> core_genome_aln.fa
-o core_genome_aln.fa
```

:::note
Expand Down
Loading

0 comments on commit d15971f

Please sign in to comment.