diff --git a/README.md b/README.md index 83e88564..11b37440 100644 --- a/README.md +++ b/README.md @@ -1,6 +1,6 @@ -# pangraph +# PanGraph -[![Documentation](https://img.shields.io/badge/Documentation-Link-blue.svg)](https://neherlab.github.io/pangraph/) +[![Documentation](https://img.shields.io/badge/Documentation-Link-blue.svg)](https://docs.pangraph.org/) ![Docker Image Version (latest semver)](https://img.shields.io/docker/v/neherlab/pangraph?label=docker) ![Docker Pulls](https://img.shields.io/docker/pulls/neherlab/pangraph) @@ -21,14 +21,13 @@ Each genome is then an ordered walk along _blocks_. The collection of all genome Pangraph is available: - as a **standalone binary** -- as a **conda package** - as a **docker container** -For more extended instructions on installation please refer to the documentation. +For more extended instructions on installation please refer to the [documentation](https://docs.pangraph.org/category/installation). ### Standalone binary -### Conda package +This is the recommended way to install Pangraph. You can download the latest release for your operating system [from here](https://docs.pangraph.org/installation/standalone). ### Docker container @@ -38,12 +37,12 @@ PanGraph is available as a Docker container: docker pull neherlab/pangraph:latest ``` -See the documentation for extended instuctions on its usage. +See the [documentation](https://docs.pangraph.org/installation/with-docker) for extended instuctions on its usage. ## Examples -Please refer to the tutorials within the documentation for an in-depth usage guide. +Please refer to the [tutorials within the documentation](https://docs.pangraph.org/category/tutorial) for an in-depth usage guide. For a quick reference, see below. Align a multi-fasta `sequences.fa` in a graph: @@ -70,7 +69,7 @@ Reconstruct input sequences from the graph: ## PyPangraph -PyPangraph is a python package with convenient utilities to load and explore the graph data structure, see the documentation for more details. +PyPangraph is a python package with convenient utilities to load and explore the graph data structure, see the [documentation](https://docs.pangraph.org/pypangraph/about-pypangraph) for installation instructions and more examples. ```python import pypangraph as pp @@ -90,3 +89,6 @@ bioRxiv 2022.02.24.481757; doi: https://doi.org/10.1101/2022.02.24.481757 ## License [MIT License](LICENSE) + +> [!NOTE] +> The legacy v0 version of Pangraph is now stored on the [`v0` branch](https://github.com/neherlab/pangraph/tree/v0) of the repository, and legacy documentation is available [here](https://v0.docs.pangraph.org/). \ No newline at end of file diff --git a/docs/docs/index.md b/docs/docs/index.md index 97e76ab0..b164b8c3 100644 --- a/docs/docs/index.md +++ b/docs/docs/index.md @@ -10,6 +10,9 @@ Pangraph is currently under heavy development. Bugs and crashes are to be expect ::: +**PanGraph** is a command-line tool for the analysis of bacterial genomes. It compresses multiple genome in a compact **graph representation**, that can be queried to extract information about the evolution of the genomes. It is developed and maintained by [the Neher lab](https://www.biozentrum.unibas.ch/about/administration/administration-a-z/overview/unit/research-group-richard-neher). + +## Why Pangraph? The content and structure of bacterial genomes evolves very rapidly: Part of the genome can be cut out, duplicated, or inverted. @@ -24,8 +27,18 @@ This is expected to be useful to parsimoniously infer horizontal gene transfer e The resultant graph represents contiguous intervals of homologous DNA as vertices and every genome as an ordered walk across such vertices. Edges of the graph are unordered and only exist if at least one genome was found to connect both vertices in either the forward or reverse strand. -For a more detailed description of the graph structure, see [what is a pangraph](/tutorial/tutorial_1#what-is-a-pangraph). +For a more detailed description of the graph structure, see [what is a pangraph](tutorial/t01-building-pangraph.md#what-is-a-pangraph). + +## Documentation outline + +This documentation contains: +- a [set of tutorials](/category/tutorial) that explain the essential steps to build and manipulate a graph. +- a [reference documentation](/reference) of the available commands. +- in addition, we provide a python library [PyPangraph](/category/pypangraph) for analysis of the graph data structure in Python + -This documentation is structures as a [set of tutorials](/category/tutorial) that explain the essential steps to build and manipulate a graph, along with a [reference documentation](/reference) of the available commands. In addition, we provide a python library [pyPanGraph](/category/pypangraph) for analysis of the graph data structure in Python. +:::info[Legacy Pangraph version] + This documentation refers to the latest version of pangraph. Code for the previous `v0` version is available on [the `v0` branch](https://github.com/neherlab/pangraph/tree/v0) of the repository, and the legacy documentation is hosted at https://v0.docs.pangraph.org/. +::: \ No newline at end of file diff --git a/docs/docs/pypangraph/_category_.json b/docs/docs/pypangraph/_category_.json index c6af3407..e6ba2aef 100644 --- a/docs/docs/pypangraph/_category_.json +++ b/docs/docs/pypangraph/_category_.json @@ -3,6 +3,6 @@ "position": 4, "link": { "type": "generated-index", - "description": "PyPanGraph is a Python package to facilitate analysis of pangraph JSON files. " + "description": "PyPanGraph is a Python package to facilitate analysis of pangraph JSON files." } } \ No newline at end of file diff --git a/docs/docs/pypangraph/about-pypangraph.md b/docs/docs/pypangraph/about-pypangraph.md new file mode 100644 index 00000000..70e0be84 --- /dev/null +++ b/docs/docs/pypangraph/about-pypangraph.md @@ -0,0 +1,43 @@ +--- +sidebar_position: 1 +--- + +# About PyPangraph + +PyPanGraph is a Python package to facilitate exploration and analysis of [PanGraph](https://github.com/neherlab/pangraph) output JSON files. + +PyPangraph can be installed following [these instructions](installation.md). + +Below you'll find some simple usage of PyPangraph. For a more complete guide you can follow the [tutorials](t01-load-graph.md). + +```python +import pypangraph as pp + +# load a graph +graph = pp.Pangraph.load_graph("graph.json") +# pangraph object with 15 paths, 137 blocks and 1042 nodes + +# recover a specific path with its identifier +path = graph.paths["RCS48_p1"] +# path object | name = RCS48_p1, n. nodes = 60, length = 80596 bp + +# extract a block alignment +block = graph.blocks[124231456905500231] +# block 124231456905500231, consensus len = 183 bp, n. nodes = 4 + +aln = block.to_biopython_alignment() +# Alignment with 15 rows and 2932 columns +# TTCTGCAATTGAGTCTTGTATGCCCCCATAACAGCACTAAATAA... +# TTCTGTAATTGAGTCTTGTATGCCCCCATAACAGCACTAAATAA... +# TTCTGTAATTGAGTCTTGTATGCCCCCATAACAGCACTAAATAA... +# TTCTGCAATTGAGTCTTGTATGCCCCCATAACAGCACTAAATAA... +# ... + +# get blocks statistics (length, copy number...) +stats_df = graph.to_blockstats_df() +# block_id count n_strains duplicated core len +# 124231456905500231 15 15 False True 2202 +# 149501466629434994 2 2 False False 210 +# 279570196774736738 2 2 False False 1308 +# ... ... ... ... ... ... +``` \ No newline at end of file diff --git a/docs/docs/pypangraph/installation.md b/docs/docs/pypangraph/installation.md index 1a17532e..e164c28d 100644 --- a/docs/docs/pypangraph/installation.md +++ b/docs/docs/pypangraph/installation.md @@ -1,13 +1,17 @@ --- -sidebar_position: 1 +sidebar_position: 2 --- # Installation -PyPanGraph is a Python package to facilitate analysis of pangraph JSON files. It can be installed from [PyPi](https://pypi.org/) or Bioconda as + +PyPanGraph can be installed from [PyPi](https://pypi.org/) or Bioconda as + ``` pip install pypangraph ``` + or + ``` conda install -c bioconda pypangraph ``` diff --git a/docs/docs/pypangraph/tutorial1.md b/docs/docs/pypangraph/t01-load-graph.md similarity index 92% rename from docs/docs/pypangraph/tutorial1.md rename to docs/docs/pypangraph/t01-load-graph.md index ff3d7e73..672f2704 100644 --- a/docs/docs/pypangraph/tutorial1.md +++ b/docs/docs/pypangraph/t01-load-graph.md @@ -1,5 +1,5 @@ --- -sidebar_position: 1 +sidebar_position: 3 --- # Loading and exploring a graph @@ -18,7 +18,7 @@ print(graph) ## The components of a graph -As explained in the [tutorial](../tutorial/tutorial_1.md#what-is-a-pangraph), a pangenome graph is composed of three main components: nodes, blocks and paths. +As explained in the [tutorial](../tutorial/t01-building-pangraph.md#what-is-a-pangraph), a pangenome graph is composed of three main components: nodes, blocks and paths. - **Blocks** encode multiple sequence alignments that group together homologous parts of the input genomes. - **Paths** are representation of the input genomes as a sequences of blocks. More precisely, as sequence of **nodes**. @@ -120,4 +120,4 @@ print(block.alignment.generate_alignment()) # '16194835320646696346': 'ATATATGGTGCGTTAATTTTTAAACCCT...'} ``` -More details on alignments are provided in [tutorial 3](tutorial3.md). +More details on alignments are provided in [tutorial 3](t03-block-alignments.md). diff --git a/docs/docs/pypangraph/tutorial2.md b/docs/docs/pypangraph/t02-pangenome.md similarity index 99% rename from docs/docs/pypangraph/tutorial2.md rename to docs/docs/pypangraph/t02-pangenome.md index 56393757..52469c48 100644 --- a/docs/docs/pypangraph/tutorial2.md +++ b/docs/docs/pypangraph/t02-pangenome.md @@ -1,5 +1,5 @@ --- -sidebar_position: 2 +sidebar_position: 4 --- # A look at the pangenome diff --git a/docs/docs/pypangraph/tutorial3.md b/docs/docs/pypangraph/t03-block-alignments.md similarity index 95% rename from docs/docs/pypangraph/tutorial3.md rename to docs/docs/pypangraph/t03-block-alignments.md index 0265694c..7d9f41ea 100644 --- a/docs/docs/pypangraph/tutorial3.md +++ b/docs/docs/pypangraph/t03-block-alignments.md @@ -1,5 +1,5 @@ --- -sidebar_position: 3 +sidebar_position: 5 --- # Exploring block alignments @@ -44,7 +44,7 @@ AlignIO.write(aln, "aln.fa", "fasta") :::info block alignment vs block sequences - As explained in [Pangraph tutorial](../tutorial/tutorial_3.md), insertions are not exported in alignments since they are not aligned to the consensus sequence of the block by pangraph. + As explained in [Pangraph tutorial](../tutorial/t03-exporting-sequences.md), insertions are not exported in alignments since they are not aligned to the consensus sequence of the block by pangraph. If these insertions are important for your analysis, you can instead export **unaligned but complete** block sequences as biopython SeqRecord objects with: @@ -72,7 +72,7 @@ AlignIO.write(aln, "aln.fa", "fasta") Other than the alignment for single blocks, we can also extract the alignment of the full core genome, i.e. the concatenated alignment of all single-copy core blocks. Pangraph has a [dedicated export subcommand](../reference.md#pangraph-export-core-genome) for this: ```bash -pangraph export core-genome --guide-strain RCS34_p1 plasmids.json > core_aln.fa +pangraph export core-genome --guide-strain RCS34_p1 plasmids.json -o core_aln.fa ``` Alternatively pypangraph provides the following method: diff --git a/docs/docs/pypangraph/tutorial4.md b/docs/docs/pypangraph/t04-core-synteny.md similarity index 95% rename from docs/docs/pypangraph/tutorial4.md rename to docs/docs/pypangraph/t04-core-synteny.md index 45a87cbd..5521f587 100644 --- a/docs/docs/pypangraph/tutorial4.md +++ b/docs/docs/pypangraph/t04-core-synteny.md @@ -1,5 +1,5 @@ --- -sidebar_position: 4 +sidebar_position: 6 --- # Paths and core genome synteny @@ -68,7 +68,7 @@ plt.show() From this plot we observe a strong conservation in the order of core blocks. This is even more explicit if we look at the graph in [Bandage](https://rrwick.github.io/Bandage/). We can use the export function of pangraph to export the graph in GFA format. By adding the `--no-duplicated` flag and the `--minimum-depth 15` option we can make sure that only core blocks are exported. ```bash -pangraph export gfa --no-duplicated --minimum-depth 15 plasmids.json > plasmids_core.gfa +pangraph export gfa --no-duplicated --minimum-depth 15 plasmids.json -o plasmids_core.gfa ``` Moreover we can save the block colors that we used in the previous plot in a csv file, that can be loaded by Bandage to color the blocks. @@ -92,7 +92,7 @@ For these cases, pypangraph provides a method to quickly survey all changes in c ![minimal synteny units](../assets/pp_t4_minimal_synteny_units.png) -For this part of the tutorial we will analyze the `graph.json` file created [in the first tutorial](../tutorial/tutorial_1.md#building-the-pangraph), containing 10 _E. coli_ chromosomes. The minimal sinteny units for this graph can be extracted with the function: +For this part of the tutorial we will analyze the `graph.json` file created [in the first tutorial](../tutorial/t01-building-pangraph.md#building-the-pangraph), containing 10 _E. coli_ chromosomes. The minimal sinteny units for this graph can be extracted with the function: ```python graph = pp.Pangraph.from_json("graph.json") @@ -142,7 +142,8 @@ Similarly to what done for plasmids, we can visualize these units on Bandage. We pangraph export gfa \ --no-duplicated \ --minimum-depth 10 \ - graph.json > ecoli.gfa + -o ecoli.gfa \ + graph.json ``` And then we can export the dictionary of core-block colors with: diff --git a/docs/docs/pypangraph/tutorial5.md b/docs/docs/pypangraph/t05-dotplot.md similarity index 98% rename from docs/docs/pypangraph/tutorial5.md rename to docs/docs/pypangraph/t05-dotplot.md index 5bd27925..b8a058e0 100644 --- a/docs/docs/pypangraph/tutorial5.md +++ b/docs/docs/pypangraph/t05-dotplot.md @@ -1,12 +1,12 @@ --- -sidebar_position: 5 +sidebar_position: 7 --- # Comparing two genomes in a dotplot Decomposing genomes in separate blocks provides a very good starting point for pairwise comparison between genomes. The pangenome graph can easily be used to draw a dotplot between two different paths, in which lines represent shared blocks. -In this example we consider the `klebs_pangraph.json` graph generated from 9 complete chromosomes of _Klebsiella Pneumoniae_ in [a previous tutorial](../tutorial/tutorial_4.md). +In this example we consider the `klebs_pangraph.json` graph generated from 9 complete chromosomes of _Klebsiella Pneumoniae_ in [a previous tutorial](../tutorial/t04-graph-projection.md). PyPangraph provides a convenient `dotplot` function to generate such a dotplot: diff --git a/docs/docs/tutorial/tutorial_1.md b/docs/docs/tutorial/t01-building-pangraph.md similarity index 95% rename from docs/docs/tutorial/tutorial_1.md rename to docs/docs/tutorial/t01-building-pangraph.md index 8be1920d..97784291 100644 --- a/docs/docs/tutorial/tutorial_1.md +++ b/docs/docs/tutorial/t01-building-pangraph.md @@ -26,10 +26,10 @@ More specifically, a path is encoded as a list of oriented block occurrences, i. The tutorial requires you to have the `pangraph` command available in your path. Instructions on how to install pangraph can be found in [Installation](../category/installation). -For this tutorial we will use a small dataset containing full chromosomes of 10 _Escherichia Coli_ strains (source: GenBank). For convenience this dataset is available in the pangraph repository (`data/ecoli.fa.gz`), and can be downloaded with the command: +For this tutorial we will use a small dataset containing full chromosomes of 10 _Escherichia Coli_ strains (source: GenBank). For convenience this dataset is available in the pangraph repository (`data/ecoli.fa.gz`), and can be downloaded [from this link](https://github.com/neherlab/pangraph/raw/master/data/ecoli.fa.gz) or by running: ```bash -wget https://github.com/neherlab/pangraph/raw/master/data/ecoli.fa.gz +curl https://github.com/neherlab/pangraph/raw/master/data/ecoli.fa.gz -o ecoli.fa.gz ``` This is a single fasta file containing 10 fully assembled bacterial chromosomes, but no plasmids. @@ -41,7 +41,7 @@ As a first step, we will build a pangraph object from the DNA of the 10 chromoso This can be done using the command `build` (see [`build` command](../reference#pangraph-build)): ```bash -pangraph build -j 4 --circular ecoli.fa.gz > graph.json +pangraph build -j 4 --circular ecoli.fa.gz -o graph.json ``` - the option `--circular` is used when passing circular DNA sequences, like the bacterial chromosomes that we consider here. - the option `-j 4` specifies the number of threads to use. @@ -113,7 +113,7 @@ nodes = { } ``` -More details on the structure of this `json` file will be covered in the [next tutorial section](tutorial_2.md). +More details on the structure of this `json` file will be covered in the [next tutorial section](t02-pangraph-output-file.md). ### Sequence diversity and alignment sensitivity @@ -144,7 +144,7 @@ As a first example, we consider exporting the pangraph in [Graphical Fragment As pangraph export gfa \ --no-duplicated \ graph.json \ - > graph.gfa + -o graph.gfa ``` This will create a `graph.gfa` file, which can be visualized using [Bandage](https://rrwick.github.io/Bandage/). @@ -161,9 +161,9 @@ pangraph export gfa \ --minimum-depth=10 \ --include-sequences \ graph.json \ - > graph_core.gfa + -o graph_core.gfa ``` -The resulting graph is much simpler. The remaining crossings are due to changes in core-genome synteny. Each change in order of core blocks results in a crossing in the graph, as will be discussed in [a later tutorial section](../pypangraph/tutorial4.md). +The resulting graph is much simpler. The remaining crossings are due to changes in core-genome synteny. Each change in order of core blocks results in a crossing in the graph, as will be discussed in [a later tutorial section](../pypangraph/t04-core-synteny.md). ![img](./../assets/t1_gfa_core.png) \ No newline at end of file diff --git a/docs/docs/tutorial/tutorial_2.md b/docs/docs/tutorial/t02-pangraph-output-file.md similarity index 97% rename from docs/docs/tutorial/tutorial_2.md rename to docs/docs/tutorial/t02-pangraph-output-file.md index e130b5b1..1034a628 100644 --- a/docs/docs/tutorial/tutorial_2.md +++ b/docs/docs/tutorial/t02-pangraph-output-file.md @@ -10,7 +10,7 @@ As an example, we will use snippets from the `graph.json` file that was produced ## The structure of `graph.json` -As discussed in the [previous tutorial section](./tutorial_1.md#what-is-a-pangraph), the three main entries of pangraph output file are `paths`, `blocks` and `nodes`. +As discussed in the [previous tutorial section](./t01-building-pangraph.md#what-is-a-pangraph), the three main entries of pangraph output file are `paths`, `blocks` and `nodes`. - each entry in the `paths` list encodes one of the nucleotide sequences that were given as input to the `build` command, represented as a list of nodes (i.e. particular instances of a block) - each entry in the `blocks` list represents an alignable set of homologous sequences. A block contains the consensus of all of these sequences, together with information to reconstruct the full alignment. Each entry in the alignment is represented by a `node`. @@ -141,7 +141,7 @@ Below is a schematic representation of how these variations are applied to the c ![img](./../assets/t2_alignment_reconstruction.png) -As discussed in the [next section](./tutorial_3.md), using information in the `alignments` dictionary the different sequences of a block can be reconstructed in two ways: +As discussed in the [next section](./t03-exporting-sequences.md), using information in the `alignments` dictionary the different sequences of a block can be reconstructed in two ways: - as **node sequences**. In this case sequences are not aligned, but each entry corresponds to the exact sequence of a node, with all variations applied. - as a **multiple sequence alignment**. In this case sequences are aligned, but insertions are omitted. diff --git a/docs/docs/tutorial/tutorial_3.md b/docs/docs/tutorial/t03-exporting-sequences.md similarity index 94% rename from docs/docs/tutorial/tutorial_3.md rename to docs/docs/tutorial/t03-exporting-sequences.md index 8c474142..ba320b74 100644 --- a/docs/docs/tutorial/tutorial_3.md +++ b/docs/docs/tutorial/t03-exporting-sequences.md @@ -16,7 +16,7 @@ Block consensus sequences can be exported using the [`export block-consensus` su ```bash pangraph export block-consensus \ graph.json \ - > block_cons.fa + -o block_cons.fa ``` This generates the `block_cons.fa` FASTA file. This file contains one entry per block, with the block ID as the header and the consensus sequence as the sequence: @@ -60,7 +60,7 @@ ATTCATGTCCTTGACTGCTTTGTTAATGTCGCACTGGA... The FASTA id of each entry is the node id, while the description contains a json string with additional information: the path name, block id, start and end positions of the node, and strandedness. -Note that while these alignments contain deletions, they _do not include insertions_. This is due to the fact that alignments are relative to the block consensus, against which insertions cannot be placed (see [the previous tutorial section](./tutorial_2#how-alignments-are-encoded)). However pangraph also provides the option to export complete, _but unaligned_, sequences for each block: +Note that while these alignments contain deletions, they _do not include insertions_. This is due to the fact that alignments are relative to the block consensus, against which insertions cannot be placed (see [the previous tutorial section](./t02-pangraph-output-file.md#how-alignments-are-encoded)). However pangraph also provides the option to export complete, _but unaligned_, sequences for each block: ```bash pangraph export block-sequences \ @@ -80,7 +80,7 @@ Pangraph also provides a quick command to extract the core-genome alignment of t pangraph export core-genome \ graph.json \ --guide-strain NC_010468 \ - > core_genome_aln.fa + -o core_genome_aln.fa ``` :::note diff --git a/docs/docs/tutorial/tutorial_4.md b/docs/docs/tutorial/t04-graph-projection.md similarity index 93% rename from docs/docs/tutorial/tutorial_4.md rename to docs/docs/tutorial/t04-graph-projection.md index b38f7422..0092b272 100644 --- a/docs/docs/tutorial/tutorial_4.md +++ b/docs/docs/tutorial/t04-graph-projection.md @@ -13,13 +13,13 @@ In this next part of the tutorial we show how to use the `simplify` command. Thi We will run this tutorial on a different dataset, containing 9 complete chromosomes of _Klebsiella Pneumoniae_ (source: GenBank). These sequences are available in the pangraph repository (`example_dataset/klebs.fa.gz`) and can be downloaded by running: ```bash -wget https://github.com/neherlab/pangraph/raw/master/example_datasets/klebs.fa.gz +curl https://github.com/neherlab/pangraph/raw/master/example_datasets/klebs.fa.gz ``` As for the previous dataset, we can create the pangraph with the command: ```bash -pangraph build --circular -j 4 klebs.fa.gz > klebs_pangraph.json +pangraph build --circular -j 4 klebs.fa.gz -o klebs_pangraph.json ``` On 4 cores the command should complete in around 4 mins using 4 threads. After creating the pangraph, we can export it in `gfa` format for visualization. @@ -28,7 +28,7 @@ On 4 cores the command should complete in around 4 mins using 4 threads. After c pangraph export gfa \ --no-duplicated \ klebs_pangraph.json \ - > klebs_pangraph.gfa + -o klebs_pangraph.gfa ``` The output file can be visualized using [Bandage](https://rrwick.github.io/Bandage/). @@ -50,7 +50,7 @@ For this example we will consider the pair of strains `NZ_CP013711` and `NC_0175 pangraph simplify \ klebs_pangraph.json \ --strains='NZ_CP013711,NC_017540' \ - > klebs_marginal_pangraph.json + -o klebs_marginal_pangraph.json ``` The file `klebs_marginal_pangraph.json` will contain the new marginalized pangraph. The strains on which one projects are specified with the flag `--strains`. They must be passed as a comma separated list of sequence ids, without spaces. @@ -80,12 +80,12 @@ pangraph export gfa \ --no-duplications \ --minimum-length 150 \ klebs_marginal_pangraph.json \ - > klebs_marginal_pangraph.gfa + -o klebs_marginal_pangraph.gfa ``` ![img](../assets/t4_klebs_marginal_pangraph.png) -As expected the marginalized pangraph contains fewer blocks than the original one (388 vs 1244), and blocks are on average longer (mean length: 14 kbp vs 6 kbp). Blocks that appear in red are shared by both strains, while black blocks are present in only one of the two strains. The pangraph is composed of two stretches of syntenic blocks, which are in contact in a central point. This structure can be understood by comparing the two chromosomes with a dotplot (see [dotplots with pypangraph](../pypangraph/tutorial5.md)) +As expected the marginalized pangraph contains fewer blocks than the original one (388 vs 1244), and blocks are on average longer (mean length: 14 kbp vs 6 kbp). Blocks that appear in red are shared by both strains, while black blocks are present in only one of the two strains. The pangraph is composed of two stretches of syntenic blocks, which are in contact in a central point. This structure can be understood by comparing the two chromosomes with a dotplot (see [dotplots with pypangraph](../pypangraph/t05-dotplot.md)) ![img](../assets/t4_klebs_dotplot.png) diff --git a/docs/docs/tutorial/tutorial_5.md b/docs/docs/tutorial/t05-example-plasmid-rearrangements.md similarity index 84% rename from docs/docs/tutorial/tutorial_5.md rename to docs/docs/tutorial/t05-example-plasmid-rearrangements.md index 10e9a64e..9cf37745 100644 --- a/docs/docs/tutorial/tutorial_5.md +++ b/docs/docs/tutorial/t05-example-plasmid-rearrangements.md @@ -12,16 +12,16 @@ Although pangraph was developed with whole genomes in mind, it can be applied to This tutorial uses a dataset of five closely-related plasmids. They were analysed previously by Sheppard et al. (2016) in a [paper](https://doi.org/10.1128/AAC.00464-16) studying an outbreak of carbapenem-resistant bacteria in a hospital in Virginia, USA. These plasmids are all similar to an [index plasmid](https://www.ncbi.nlm.nih.gov/nuccore/CP017937.1) from the hospital, but have some structural changes. We will show how pangraph output can be used to visualize this structural diversity. -You can download these sequences by running: +You can download these sequences [from this link](https://github.com/liampshaw/pangraph-tutorials/raw/main/data/sheppard/UVA01_plasmids.fa.gz) or by running: ```bash -wget https://github.com/liampshaw/pangraph-tutorials/raw/main/data/sheppard/UVA01_plasmids.fa.gz +curl https://github.com/liampshaw/pangraph-tutorials/raw/main/data/sheppard/UVA01_plasmids.fa.gz -o UVA01_plasmids.fa.gz ``` Building the pangraph and exporting it for visualization is done with these commands (should be very quick as we are using plasmids, which are much smaller than whole genomes): ```bash -pangraph build --circular UVA01_plasmids.fa.gz > UVA01_plasmids_pangraph.json +pangraph build --circular UVA01_plasmids.fa.gz -o UVA01_plasmids_pangraph.json pangraph export gfa -o UVA01_plasmids_pangraph.gfa --minimum-length 0 UVA01_plasmids_pangraph.json ``` @@ -40,10 +40,14 @@ Here, the node colour represents the depth of the blocks. However, it is difficu We can use some custom scripts to look at representations of the plasmids alongside their pangraph. These scripts are not part of pangraph but are an example of how to process the output into visualizations. You can download them by running: ```bash -wget https://raw.githubusercontent.com/liampshaw/pangraph-tutorials/main/scripts/prepare-pangraph-gfa-rust.py -wget https://raw.githubusercontent.com/liampshaw/pangraph-tutorials/main/scripts/plot-blocks-UVA01-rust.R +curl https://raw.githubusercontent.com/liampshaw/pangraph-tutorials/main/scripts/prepare-pangraph-gfa-rust.py -o prepare-pangraph-gfa-rust.py +curl https://raw.githubusercontent.com/liampshaw/pangraph-tutorials/main/scripts/plot-blocks-UVA01-rust.R -o plot-blocks-UVA01-rust.R ``` +Alternatively, you can download the scripts from the following links: +- [prepare-pangraph-gfa-rust.py](https://raw.githubusercontent.com/liampshaw/pangraph-tutorials/main/scripts/prepare-pangraph-gfa-rust.py) +- [plot-blocks-UVA01-rust.R](https://raw.githubusercontent.com/liampshaw/pangraph-tutorials/main/scripts/plot-blocks-UVA01-rust.R) + First, we run a script to generate random colours for the blocks. ```bash diff --git a/docs/docs/tutorial/tutorial_6.md b/docs/docs/tutorial/t06-example-resistance-gene-neighbourhood.md similarity index 76% rename from docs/docs/tutorial/tutorial_6.md rename to docs/docs/tutorial/t06-example-resistance-gene-neighbourhood.md index 39234758..9b2b0319 100644 --- a/docs/docs/tutorial/tutorial_6.md +++ b/docs/docs/tutorial/t06-example-resistance-gene-neighbourhood.md @@ -14,10 +14,10 @@ Theoretically you could run pangraph on the full genomes of the isolates. Howeve [^1]: David et al. actually long-read sequenced n=44 isolates. For simplicity, we have only kept contigs that were sufficiently long, hence 34. -You can download this dataset by running the following: +You can download this dataset [from this link](https://raw.githubusercontent.com/liampshaw/pangraph-tutorials/main/data/kpc/kpc-contigs-u10k-d5k.fa) or by running the following: ```bash -wget https://raw.githubusercontent.com/liampshaw/pangraph-tutorials/main/data/kpc/kpc-contigs-u10k-d5k.fa +curl https://raw.githubusercontent.com/liampshaw/pangraph-tutorials/main/data/kpc/kpc-contigs-u10k-d5k.fa -o kpc-contigs-u10k-d5k.fa ``` ## Running pangraph @@ -25,7 +25,7 @@ wget https://raw.githubusercontent.com/liampshaw/pangraph-tutorials/main/data/kp We can now run pangraph on these extracted regions (n=34). This should take only a few seconds to run. ```bash -pangraph build kpc-contigs-u10k-d5k.fa > pangraph_kpc_u10k_d5k.json +pangraph build kpc-contigs-u10k-d5k.fa -o pangraph_kpc_u10k_d5k.json pangraph export gfa --minimum-length 0 -o pangraph_kpc_u10k_d5k.gfa pangraph_kpc_u10k_d5k.json pangraph export block-consensus -o pangraph_kpc_u10k_d5k.fa pangraph_kpc_u10k_d5k.json @@ -37,10 +37,10 @@ These commands give us three forms of pangraph output: * `pangraph_kpc_u10k_d5k.gfa` - graph in [GFA](http://gfa-spec.github.io/GFA-spec/GFA1.html) format * `pangraph_kpc_u10k_d5k.fa` - multifasta containing the consensus sequences of the pangenome blocks -We know by construction that the KPC gene should be in all the contigs, so should be in the same alignment block in all sequences. If we download the KPC gene, we can then use `blast` to find this block from the fasta file with the consensus sequences of the blocks. (You will need to install [blast](https://blast.ncbi.nlm.nih.gov/doc/blast-help/downloadblastdata.html) if you don't yet have it.) +We know by construction that the KPC gene should be in all the contigs, so should be in the same alignment block in all sequences. If we download the KPC gene (with the coommand below or using [this link](https://raw.githubusercontent.com/liampshaw/pangraph-tutorials/main/data/kpc/kpc2.fa)), we can then use `blast` to find this block from the fasta file with the consensus sequences of the blocks. (You will need to install [blast](https://blast.ncbi.nlm.nih.gov/doc/blast-help/downloadblastdata.html) if you don't yet have it.) ```bash -wget https://raw.githubusercontent.com/liampshaw/pangraph-tutorials/main/data/kpc/kpc2.fa +curl https://raw.githubusercontent.com/liampshaw/pangraph-tutorials/main/data/kpc/kpc2.fa -o kpc2.fa makeblastdb -in pangraph_kpc_u10k_d5k.fa -dbtype 'nucl' geneBlock=$(blastn -query kpc2.fa -db pangraph_kpc_u10k_d5k.fa -outfmt 6 | cut -f 2) echo $geneBlock @@ -54,10 +54,13 @@ Similar to the previous tutorial, we then convert the gfa into a csv that stores ```bash # Download custom script -wget https://raw.githubusercontent.com/liampshaw/pangraph-tutorials/main/scripts/prepare-pangraph-gfa-rust.py +curl https://raw.githubusercontent.com/liampshaw/pangraph-tutorials/main/scripts/prepare-pangraph-gfa-rust.py -o prepare-pangraph-gfa-rust.py python prepare-pangraph-gfa-rust.py pangraph_kpc_u10k_d5k.gfa ``` +Alternatively, you can download the script from the following link: +- [prepare-pangraph-gfa-rust.py](https://raw.githubusercontent.com/liampshaw/pangraph-tutorials/main/scripts/prepare-pangraph-gfa-rust.py) + This makes three output files: * `${input}.blocks.csv` - dataset of genome and block start/end positions @@ -85,16 +88,19 @@ From this visualization, we can see that most of these KPC-positive have a very ## Linear visualization -The graph visualization with Bandage is helpful, but it can also be useful to view the unique paths through the graph as a linear visualization. In this section we will use an R script to do this, as in the previous tutorial. +The graph visualization with Bandage is helpful, but it can also be useful to view the unique paths through the graph as a linear visualization. In this section we will use an R script to do this, as in the previous tutorial. ```bash -wget https://raw.githubusercontent.com/liampshaw/pangraph-tutorials/main/scripts/plot-blocks.R +curl https://raw.githubusercontent.com/liampshaw/pangraph-tutorials/main/scripts/plot-blocks.R -o plot-blocks.R Rscript plot-blocks.R \ pangraph_kpc_u10k_d5k.gfa.blocks.csv \ $geneBlock pangraph_kpc_u10k_d5k.gfa.png \ pangraph_kpc_plot.pdf ``` +Alternatively, you can download the script from the following link: +- [plot-blocks.R](https://raw.githubusercontent.com/liampshaw/pangraph-tutorials/main/scripts/plot-blocks.R) + ![img](../assets/linear_and_graph_kpc.png) If you pick a genome on the left of the plot, you should be able to follow its path through the graph representation on the right using the colours.[^4] The block starting at position 0 is the KPC-block. diff --git a/packages/pypangraph/README.md b/packages/pypangraph/README.md index ad76d887..e9aca5ae 100644 --- a/packages/pypangraph/README.md +++ b/packages/pypangraph/README.md @@ -2,13 +2,12 @@ This repository contains a collection of utilities to load, explore and analyze pangrenome graphs produced by [PanGraph](https://github.com/neherlab/pangraph). -The package can be installed via pip: - +The package can be installed via pip or conda, see [the documentation](https://docs.pangraph.org/pypangraph/installation): ```bash pip install pypangraph ``` -Below are some examples showcasing some of the main functions in the package. More detailed information can be found in the documentation. +Below are some examples showcasing some of the main functions in the package. More detailed information and examples can be found in the [documentation](https://docs.pangraph.org/category/pypangraph). ## Loading and interacting with pangraph objects: