Merge pull request #127 from neherlab/refine-docs

Refine docs
neherlab · Feb 11, 2025 · d15971f · d15971f
2 parents 207646e + b000764
commit d15971f
Show file tree

Hide file tree

Showing 17 changed files with 132 additions and 60 deletions.
diff --git a/README.md b/README.md
@@ -1,6 +1,6 @@
-# pangraph
+# PanGraph
 
-[![Documentation](https://img.shields.io/badge/Documentation-Link-blue.svg)](https://neherlab.github.io/pangraph/)
+[![Documentation](https://img.shields.io/badge/Documentation-Link-blue.svg)](https://docs.pangraph.org/)
 ![Docker Image Version (latest semver)](https://img.shields.io/docker/v/neherlab/pangraph?label=docker)
 ![Docker Pulls](https://img.shields.io/docker/pulls/neherlab/pangraph)
 
@@ -21,14 +21,13 @@ Each genome is then an ordered walk along _blocks_. The collection of all genome
 
 Pangraph is available:
 - as a **standalone binary**
-- as a **conda package**
 - as a **docker container**
 
-For more extended instructions on installation please refer to the documentation.
+For more extended instructions on installation please refer to the [documentation](https://docs.pangraph.org/category/installation).
 
 ### Standalone binary
 
-### Conda package
+This is the recommended way to install Pangraph. You can download the latest release for your operating system [from here](https://docs.pangraph.org/installation/standalone).
 
 ### Docker container
 
@@ -38,12 +37,12 @@ PanGraph is available as a Docker container:
     docker pull neherlab/pangraph:latest
 ```
 
-See the documentation for extended instuctions on its usage.
+See the [documentation](https://docs.pangraph.org/installation/with-docker) for extended instuctions on its usage.
 
 
 ## Examples
 
-Please refer to the tutorials within the documentation for an in-depth usage guide.
+Please refer to the [tutorials within the documentation](https://docs.pangraph.org/category/tutorial) for an in-depth usage guide.
 For a quick reference, see below.
 
 Align a multi-fasta `sequences.fa` in a graph:
@@ -70,7 +69,7 @@ Reconstruct input sequences from the graph:
 
 ## PyPangraph
 
-PyPangraph is a python package with convenient utilities to load and explore the graph data structure, see the documentation for more details.
+PyPangraph is a python package with convenient utilities to load and explore the graph data structure, see the [documentation](https://docs.pangraph.org/pypangraph/about-pypangraph) for installation instructions and more examples.
 
 ```python
 import pypangraph as pp
@@ -90,3 +89,6 @@ bioRxiv 2022.02.24.481757; doi: https://doi.org/10.1101/2022.02.24.481757
 ## License
 
 [MIT License](LICENSE)
+
+> [!NOTE]  
+> The legacy v0 version of Pangraph is now stored on the [`v0` branch](https://github.com/neherlab/pangraph/tree/v0) of the repository, and legacy documentation is available [here](https://v0.docs.pangraph.org/).
diff --git a/docs/docs/index.md b/docs/docs/index.md
@@ -10,6 +10,9 @@ Pangraph is currently under heavy development. Bugs and crashes are to be expect
 
 :::
 
+**PanGraph** is a command-line tool for the analysis of bacterial genomes. It compresses multiple genome in a compact **graph representation**, that can be queried to extract information about the evolution of the genomes. It is developed and maintained by [the Neher lab](https://www.biozentrum.unibas.ch/about/administration/administration-a-z/overview/unit/research-group-richard-neher).
+
+## Why Pangraph?
 
 The content and structure of bacterial genomes evolves very rapidly:
 Part of the genome can be cut out, duplicated, or inverted.
@@ -24,8 +27,18 @@ This is expected to be useful to parsimoniously infer horizontal gene transfer e
 
 The resultant graph represents contiguous intervals of homologous DNA as vertices and every genome as an ordered walk across such vertices.
 Edges of the graph are unordered and only exist if at least one genome was found to connect both vertices in either the forward or reverse strand.
-For a more detailed description of the graph structure, see [what is a pangraph](/tutorial/tutorial_1#what-is-a-pangraph).
+For a more detailed description of the graph structure, see [what is a pangraph](tutorial/t01-building-pangraph.md#what-is-a-pangraph).
+
+## Documentation outline
+
+This documentation contains: 
+- a [set of tutorials](/category/tutorial) that explain the essential steps to build and manipulate a graph.
+- a [reference documentation](/reference) of the available commands.
+- in addition, we provide a python library [PyPangraph](/category/pypangraph) for analysis of the graph data structure in Python
+
 
-This documentation is structures as a [set of tutorials](/category/tutorial) that explain the essential steps to build and manipulate a graph, along with a [reference documentation](/reference) of the available commands. In addition, we provide a python library [pyPanGraph](/category/pypangraph) for analysis of the graph data structure in Python.
+:::info[Legacy Pangraph version]
 
+    This documentation refers to the latest version of pangraph. Code for the previous `v0` version is available on [the `v0` branch](https://github.com/neherlab/pangraph/tree/v0) of the repository, and the legacy documentation is hosted at https://v0.docs.pangraph.org/.
 
+:::
diff --git a/docs/docs/pypangraph/_category_.json b/docs/docs/pypangraph/_category_.json
@@ -3,6 +3,6 @@
   "position": 4,
   "link": {
     "type": "generated-index",
-    "description": "PyPanGraph is a Python package to facilitate analysis of pangraph JSON files. "
+    "description": "PyPanGraph is a Python package to facilitate analysis of pangraph JSON files."
   }
 }
diff --git a/docs/docs/pypangraph/about-pypangraph.md b/docs/docs/pypangraph/about-pypangraph.md
@@ -0,0 +1,43 @@
+---
+sidebar_position: 1
+---
+
+# About PyPangraph
+
+PyPanGraph is a Python package to facilitate exploration and analysis of [PanGraph](https://github.com/neherlab/pangraph) output JSON files.
+
+PyPangraph can be installed following [these instructions](installation.md).
+
+Below you'll find some simple usage of PyPangraph. For a more complete guide you can follow the [tutorials](t01-load-graph.md).
+
+```python
+import pypangraph as pp
+
+# load a graph
+graph = pp.Pangraph.load_graph("graph.json")
+# pangraph object with 15 paths, 137 blocks and 1042 nodes
+
+# recover a specific path with its identifier
+path = graph.paths["RCS48_p1"]
+# path object | name = RCS48_p1, n. nodes = 60, length = 80596 bp
+
+# extract a block alignment
+block = graph.blocks[124231456905500231]
+# block 124231456905500231, consensus len = 183 bp, n. nodes = 4
+
+aln = block.to_biopython_alignment()
+# Alignment with 15 rows and 2932 columns
+# TTCTGCAATTGAGTCTTGTATGCCCCCATAACAGCACTAAATAA...
+# TTCTGTAATTGAGTCTTGTATGCCCCCATAACAGCACTAAATAA...
+# TTCTGTAATTGAGTCTTGTATGCCCCCATAACAGCACTAAATAA...
+# TTCTGCAATTGAGTCTTGTATGCCCCCATAACAGCACTAAATAA...
+# ...
+
+# get blocks statistics (length, copy number...)
+stats_df = graph.to_blockstats_df()
+# block_id              count  n_strains  duplicated   core   len
+# 124231456905500231       15         15       False   True  2202
+# 149501466629434994        2          2       False  False   210
+# 279570196774736738        2          2       False  False  1308
+# ...                     ...        ...         ...    ...   ...
+```
diff --git a/docs/docs/pypangraph/installation.md b/docs/docs/pypangraph/installation.md
@@ -1,13 +1,17 @@
 ---
-sidebar_position: 1
+sidebar_position: 2
 ---
 
 # Installation
-PyPanGraph is a Python package to facilitate analysis of pangraph JSON files. It can be installed from [PyPi](https://pypi.org/) or Bioconda as
+
+PyPanGraph can be installed from [PyPi](https://pypi.org/) or Bioconda as
+
 ```
 pip install pypangraph
 ```
+
 or
+
 ```
 conda install -c bioconda pypangraph
 ```

diff --git a/docs/docs/pypangraph/tutorial1.md → docs/docs/pypangraph/t01-load-graph.md b/docs/docs/pypangraph/tutorial1.md → docs/docs/pypangraph/t01-load-graph.md
@@ -1,5 +1,5 @@
 ---
-sidebar_position: 1
+sidebar_position: 3
 ---
 
 # Loading and exploring a graph
@@ -18,7 +18,7 @@ print(graph)
 
 ## The components of a graph
 
-As explained in the [tutorial](../tutorial/tutorial_1.md#what-is-a-pangraph), a pangenome graph is composed of three main components: nodes, blocks and paths.
+As explained in the [tutorial](../tutorial/t01-building-pangraph.md#what-is-a-pangraph), a pangenome graph is composed of three main components: nodes, blocks and paths.
 
 - **Blocks** encode multiple sequence alignments that group together homologous parts of the input genomes.
 - **Paths** are representation of the input genomes as a sequences of blocks. More precisely, as sequence of **nodes**.
@@ -120,4 +120,4 @@ print(block.alignment.generate_alignment())
 #  '16194835320646696346': 'ATATATGGTGCGTTAATTTTTAAACCCT...'}
 ```
 
-More details on alignments are provided in [tutorial 3](tutorial3.md).
+More details on alignments are provided in [tutorial 3](t03-block-alignments.md).
diff --git a/docs/docs/pypangraph/tutorial2.md → docs/docs/pypangraph/t02-pangenome.md b/docs/docs/pypangraph/tutorial2.md → docs/docs/pypangraph/t02-pangenome.md
@@ -1,5 +1,5 @@
 ---
-sidebar_position: 2
+sidebar_position: 4
 ---
 
 # A look at the pangenome

diff --git a/docs/docs/pypangraph/tutorial3.md → docs/docs/pypangraph/t03-block-alignments.md b/docs/docs/pypangraph/tutorial3.md → docs/docs/pypangraph/t03-block-alignments.md
@@ -1,5 +1,5 @@
 ---
-sidebar_position: 3
+sidebar_position: 5
 ---
 
 # Exploring block alignments
@@ -44,7 +44,7 @@ AlignIO.write(aln, "aln.fa", "fasta")
 
 :::info block alignment vs block sequences
 
-    As explained in [Pangraph tutorial](../tutorial/tutorial_3.md), insertions are not exported in alignments since they are not aligned to the consensus sequence of the block by pangraph.
+    As explained in [Pangraph tutorial](../tutorial/t03-exporting-sequences.md), insertions are not exported in alignments since they are not aligned to the consensus sequence of the block by pangraph.
 
     If these insertions are important for your analysis, you can instead export **unaligned but complete** block sequences as biopython SeqRecord objects with:
 
@@ -72,7 +72,7 @@ AlignIO.write(aln, "aln.fa", "fasta")
 Other than the alignment for single blocks, we can also extract the alignment of the full core genome, i.e. the concatenated alignment of all single-copy core blocks. Pangraph has a [dedicated export subcommand](../reference.md#pangraph-export-core-genome) for this:
 
 ```bash
-pangraph export core-genome --guide-strain RCS34_p1 plasmids.json > core_aln.fa
+pangraph export core-genome --guide-strain RCS34_p1 plasmids.json -o core_aln.fa
 ```
 
 Alternatively pypangraph provides the following method:

diff --git a/docs/docs/pypangraph/tutorial4.md → docs/docs/pypangraph/t04-core-synteny.md b/docs/docs/pypangraph/tutorial4.md → docs/docs/pypangraph/t04-core-synteny.md
@@ -1,5 +1,5 @@
 ---
-sidebar_position: 4
+sidebar_position: 6
 ---
 
 # Paths and core genome synteny
@@ -68,7 +68,7 @@ plt.show()
 From this plot we observe a strong conservation in the order of core blocks. This is even more explicit if we look at the graph in [Bandage](https://rrwick.github.io/Bandage/). We can use the export function of pangraph to export the graph in GFA format. By adding the `--no-duplicated` flag and the `--minimum-depth 15` option we can make sure that only core blocks are exported.
 
 ```bash
-pangraph export gfa --no-duplicated --minimum-depth 15 plasmids.json > plasmids_core.gfa
+pangraph export gfa --no-duplicated --minimum-depth 15 plasmids.json -o plasmids_core.gfa
 ```
 
 Moreover we can save the block colors that we used in the previous plot in a csv file, that can be loaded by Bandage to color the blocks.
@@ -92,7 +92,7 @@ For these cases, pypangraph provides a method to quickly survey all changes in c
 
 ![minimal synteny units](../assets/pp_t4_minimal_synteny_units.png)
 
-For this part of the tutorial we will analyze the `graph.json` file created [in the first tutorial](../tutorial/tutorial_1.md#building-the-pangraph), containing 10 _E. coli_ chromosomes. The minimal sinteny units for this graph can be extracted with the function:
+For this part of the tutorial we will analyze the `graph.json` file created [in the first tutorial](../tutorial/t01-building-pangraph.md#building-the-pangraph), containing 10 _E. coli_ chromosomes. The minimal sinteny units for this graph can be extracted with the function:
 
 ```python
 graph = pp.Pangraph.from_json("graph.json")
@@ -142,7 +142,8 @@ Similarly to what done for plasmids, we can visualize these units on Bandage. We
 pangraph export gfa \
     --no-duplicated \
     --minimum-depth 10 \
-    graph.json > ecoli.gfa
+    -o ecoli.gfa \
+    graph.json 
 ```
 
 And then we can export the dictionary of core-block colors with:

diff --git a/docs/docs/pypangraph/tutorial5.md → docs/docs/pypangraph/t05-dotplot.md b/docs/docs/pypangraph/tutorial5.md → docs/docs/pypangraph/t05-dotplot.md
@@ -1,12 +1,12 @@
 ---
-sidebar_position: 5
+sidebar_position: 7
 ---
 
 # Comparing two genomes in a dotplot
 
 Decomposing genomes in separate blocks provides a very good starting point for pairwise comparison between genomes. The pangenome graph can easily be used to draw a dotplot between two different paths, in which lines represent shared blocks.
 
-In this example we consider the `klebs_pangraph.json` graph generated from 9 complete chromosomes of _Klebsiella Pneumoniae_ in [a previous tutorial](../tutorial/tutorial_4.md).
+In this example we consider the `klebs_pangraph.json` graph generated from 9 complete chromosomes of _Klebsiella Pneumoniae_ in [a previous tutorial](../tutorial/t04-graph-projection.md).
 
 PyPangraph provides a convenient `dotplot` function to generate such a dotplot:
 

diff --git a/docs/docs/tutorial/tutorial_1.md → docs/docs/tutorial/t01-building-pangraph.md b/docs/docs/tutorial/tutorial_1.md → docs/docs/tutorial/t01-building-pangraph.md
@@ -26,10 +26,10 @@ More specifically, a path is encoded as a list of oriented block occurrences, i.
 
 The tutorial requires you to have the `pangraph` command available in your path. Instructions on how to install pangraph can be found in [Installation](../category/installation).
 
-For this tutorial we will use a small dataset containing full chromosomes of 10 _Escherichia Coli_ strains (source: GenBank). For convenience this dataset is available in the pangraph repository (`data/ecoli.fa.gz`), and can be downloaded with the command:
+For this tutorial we will use a small dataset containing full chromosomes of 10 _Escherichia Coli_ strains (source: GenBank). For convenience this dataset is available in the pangraph repository (`data/ecoli.fa.gz`), and can be downloaded [from this link](https://github.com/neherlab/pangraph/raw/master/data/ecoli.fa.gz) or by running:
 
 ```bash
-wget https://github.com/neherlab/pangraph/raw/master/data/ecoli.fa.gz
+curl https://github.com/neherlab/pangraph/raw/master/data/ecoli.fa.gz -o ecoli.fa.gz
 ```
 
 This is a single fasta file containing 10 fully assembled bacterial chromosomes, but no plasmids.
@@ -41,7 +41,7 @@ As a first step, we will build a pangraph object from the DNA of the 10 chromoso
 This can be done using the command `build` (see [`build` command](../reference#pangraph-build)):
 
 ```bash
-pangraph build -j 4 --circular ecoli.fa.gz > graph.json
+pangraph build -j 4 --circular ecoli.fa.gz -o graph.json
 ```
 - the option `--circular` is used when passing circular DNA sequences, like the bacterial chromosomes that we consider here.
 - the option `-j 4` specifies the number of threads to use.
@@ -113,7 +113,7 @@ nodes = {
 }
 ```
 
-More details on the structure of this `json` file will be covered in the [next tutorial section](tutorial_2.md).
+More details on the structure of this `json` file will be covered in the [next tutorial section](t02-pangraph-output-file.md).
 
 
 ### Sequence diversity and alignment sensitivity
@@ -144,7 +144,7 @@ As a first example, we consider exporting the pangraph in [Graphical Fragment As
 pangraph export gfa \
     --no-duplicated \
     graph.json \
-    > graph.gfa
+    -o graph.gfa
 ```
 
 This will create a `graph.gfa` file, which can be visualized using [Bandage](https://rrwick.github.io/Bandage/).
@@ -161,9 +161,9 @@ pangraph export gfa \
     --minimum-depth=10 \
     --include-sequences \
     graph.json \
-    > graph_core.gfa
+    -o graph_core.gfa
 ```
 
-The resulting graph is much simpler. The remaining crossings are due to changes in core-genome synteny. Each change in order of core blocks results in a crossing in the graph, as will be discussed in [a later tutorial section](../pypangraph/tutorial4.md).
+The resulting graph is much simpler. The remaining crossings are due to changes in core-genome synteny. Each change in order of core blocks results in a crossing in the graph, as will be discussed in [a later tutorial section](../pypangraph/t04-core-synteny.md).
 
 ![img](./../assets/t1_gfa_core.png)
diff --git a/docs/docs/tutorial/tutorial_2.md → ...docs/tutorial/t02-pangraph-output-file.md b/docs/docs/tutorial/tutorial_2.md → ...docs/tutorial/t02-pangraph-output-file.md
@@ -10,7 +10,7 @@ As an example, we will use snippets from the `graph.json` file that was produced
 
 ## The structure of `graph.json`
 
-As discussed in the [previous tutorial section](./tutorial_1.md#what-is-a-pangraph), the three main entries of pangraph output file are `paths`, `blocks` and `nodes`.
+As discussed in the [previous tutorial section](./t01-building-pangraph.md#what-is-a-pangraph), the three main entries of pangraph output file are `paths`, `blocks` and `nodes`.
 
 - each entry in the `paths` list encodes one of the nucleotide sequences that were given as input to the `build` command, represented as a list of nodes (i.e. particular instances of a block)
 - each entry in the `blocks` list represents an alignable set of homologous sequences. A block contains the consensus of all of these sequences, together with information to reconstruct the full alignment. Each entry in the alignment is represented by a `node`.
@@ -141,7 +141,7 @@ Below is a schematic representation of how these variations are applied to the c
 
 ![img](./../assets/t2_alignment_reconstruction.png)
 
-As discussed in the [next section](./tutorial_3.md), using information in the `alignments` dictionary the different sequences of a block can be reconstructed in two ways:
+As discussed in the [next section](./t03-exporting-sequences.md), using information in the `alignments` dictionary the different sequences of a block can be reconstructed in two ways:
 - as **node sequences**. In this case sequences are not aligned, but each entry corresponds to the exact sequence of a node, with all variations applied.
 - as a **multiple sequence alignment**. In this case sequences are aligned, but insertions are omitted.
 

diff --git a/docs/docs/tutorial/tutorial_3.md → .../docs/tutorial/t03-exporting-sequences.md b/docs/docs/tutorial/tutorial_3.md → .../docs/tutorial/t03-exporting-sequences.md
@@ -16,7 +16,7 @@ Block consensus sequences can be exported using the [`export block-consensus` su
 ```bash
 pangraph export block-consensus \
     graph.json \
-    > block_cons.fa
+    -o block_cons.fa
 ```
 
 This generates the `block_cons.fa` FASTA file. This file contains one entry per block, with the block ID as the header and the consensus sequence as the sequence:
@@ -60,7 +60,7 @@ ATTCATGTCCTTGACTGCTTTGTTAATGTCGCACTGGA...
 
 The FASTA id of each entry is the node id, while the description contains a json string with additional information: the path name, block id, start and end positions of the node, and strandedness.
 
-Note that while these alignments contain deletions, they _do not include insertions_. This is due to the fact that alignments are relative to the block consensus, against which insertions cannot be placed (see [the previous tutorial section](./tutorial_2#how-alignments-are-encoded)). However pangraph also provides the option to export complete, _but unaligned_, sequences for each block:
+Note that while these alignments contain deletions, they _do not include insertions_. This is due to the fact that alignments are relative to the block consensus, against which insertions cannot be placed (see [the previous tutorial section](./t02-pangraph-output-file.md#how-alignments-are-encoded)). However pangraph also provides the option to export complete, _but unaligned_, sequences for each block:
 
 ```bash
 pangraph export block-sequences \
@@ -80,7 +80,7 @@ Pangraph also provides a quick command to extract the core-genome alignment of t
 pangraph export core-genome \
     graph.json \
     --guide-strain NC_010468 \
-    > core_genome_aln.fa
+   -o core_genome_aln.fa
 ```
 
 :::note