diff --git a/docs/usage.md b/docs/usage.md index 8377628d..6ba58501 100644 --- a/docs/usage.md +++ b/docs/usage.md @@ -6,7 +6,7 @@ ## Samplesheet input -You will need to create a samplesheet with information about the samples you would like to analyse before running the pipeline. Use this parameter to specify its location. It has to be a comma-separated file with 4 columns, and a header row as shown in the examples below. +Before running the pipeline, create a samplesheet with information about the samples you would like to analyse. This samplesheet contains the files that will be passed as inputs to the pipeline. The --input parameter is used to specify the samplesheet location. It has to be a comma-separated file with 4 columns, and a header row as shown in the examples below. ```bash --input '[path to samplesheet file]' @@ -14,49 +14,56 @@ You will need to create a samplesheet with information about the samples you wou ### Multiple replicates -The `sample` identifier is the same when you have multiple biological replicates from the same experimental group, just increment the `replicate` identifier appropriately. The first replicate value for any given experimental group must be 1. Below is an example for a single experimental group in triplicate: +The `sample` identifier is the same when you have multiple biological replicates from the same experimental group, just increment the `replicate` identifier appropriately. The first replicate value for any given experimental group must be 1. Below is an example for the analysis of paired-end sequencing of ATAC-seq experiment performed in triplicate for the cell line "A" : ```console sample,fastq_1,fastq_2,replicate -CONTROL,AEG588A1_S1_L002_R1_001.fastq.gz,AEG588A1_S1_L002_R2_001.fastq.gz,1 -CONTROL,AEG588A1_S1_L003_R1_001.fastq.gz,AEG588A1_S1_L003_R2_001.fastq.gz,2 -CONTROL,AEG588A1_S1_L004_R1_001.fastq.gz,AEG588A1_S1_L004_R2_001.fastq.gz,3 +SAMPLE_A,AEG588A1_S1_L002_R1_001.fastq.gz,AEG588A1_S1_L002_R2_001.fastq.gz,1 +SAMPLE_A,AEG588A1_S1_L003_R1_001.fastq.gz,AEG588A1_S1_L003_R2_001.fastq.gz,2 +SAMPLE_A,AEG588A1_S1_L004_R1_001.fastq.gz,AEG588A1_S1_L004_R2_001.fastq.gz,3 ``` -The pipeline will automatically append the `&_REP` suffix to the sample name within the pipeline e.g. `CONTROL_REP1`, `CONTROL_REP2` and `CONTROL_REP3` using the example above. If you don't have replicates you can set the `replicate` value to 1 for all of your samples. +The pipeline will automatically append the `&_REP` suffix to the sample name within the pipeline e.g. `SAMPLE_A_REP1`, `SAMPLE_A_REP2` and `SAMPLE_A_REP3` using the example above. If you don't have replicates you can set the `replicate` value to 1 for all of your samples. Below an example for the analysis of of paired-end sequencing of ATAC-seq experiment performed without replicates for the cell line "A" and for the tissue "B": + +```console +sample,fastq_1,fastq_2,replicate +SAMPLE_A,AEG588A1_S1_L002_R1_001.fastq.gz,AEG588A1_S1_L002_R2_001.fastq.gz,1 +SAMPLE_B,BEG599B2_S1_L003_R1_001.fastq.gz,BEG599B2_S1_L003_R2_001.fastq.gz,1 +``` ### Multiple runs of the same sample -The `sample` and `replicate` identifiers have to be the same when you have re-sequenced the same sample more than once e.g. to increase sequencing depth. The pipeline will perform the alignments in parallel, and subsequently merge them before further analysis. Below is an example a sample sequenced across 3 lanes: +The `sample` and `replicate` identifiers have to be the same when you have re-sequenced the same sample more than once e.g. to increase sequencing depth. The pipeline will perform the alignments in parallel, and subsequently merge them before further analysis. Below is an example of how the samplesheet for SAMPLE_A would look like if sequenced across 3 lanes: ```csv title="samplesheet.csv" sample,fastq_1,fastq_2,replicate -CONTROL,AEG588A1_S1_L002_R1_001.fastq.gz,AEG588A1_S1_L002_R2_001.fastq.gz,1 -CONTROL,AEG588A1_S1_L003_R1_001.fastq.gz,AEG588A1_S1_L003_R2_001.fastq.gz,1 -CONTROL,AEG588A1_S1_L004_R1_001.fastq.gz,AEG588A1_S1_L004_R2_001.fastq.gz,1 +SAMPLE_A,AEG588A1_S1_L002_R1_001.fastq.gz,AEG588A1_S1_L002_R2_001.fastq.gz,1 +SAMPLE_A,AEG588A1_S1_L003_R1_001.fastq.gz,AEG588A1_S1_L003_R2_001.fastq.gz,1 +SAMPLE_A,AEG588A1_S1_L004_R1_001.fastq.gz,AEG588A1_S1_L004_R2_001.fastq.gz,1 ``` -The pipeline will automatically append the `*_T` suffix to the sample name within the pipeline e.g. `CONTROL_REP1_T1`, `CONTROL_REP1_T2` and `CONTROL_REP1_T3` using the example above. +The pipeline will automatically append the `*_T` suffix to the sample name within the pipeline e.g. `SAMPLE_A_REP1_T1`, `SAMPLE_A_REP1_T2` and `SAMPLE_A_REP1_T3` using the example above. -### Control data +### INPUT control data -If controls are to be used for peak calling use the parameter `--with_control`. In this case, the samplesheet file needs the additional columns `control` and `control_replicate`. These should be the sample identifier and sample replicate for the controls. +An input control is a file that can be used during peak calling to estimate the background of the experiment. If input controls sequencing information is available, it can be used for peak calling using the parameter `--with_control`. In this case, the samplesheet file needs the additional columns `control` and `control_replicate`. These should be the sample identifier and sample replicate for the input controls, as in the example below. ### Full samplesheet The pipeline will auto-detect whether a sample is single- or paired-end using the information provided in the samplesheet. The samplesheet can have as many columns as you desire, however, there is a strict requirement for the first 4 columns to match those defined in the table below. -A final samplesheet file consisting of both single- and paired-end data may look something like the one below. This is for 7 samples, where we have biological triplicates for both the `CONTROL` and `TREATMENT` groups, and the third replicate in the `TREATMENT` group has been a technical replicate as a result of being sequenced twice. +A final samplesheet file consisting of both single- and paired-end data may look something like the one below. This is for 7 samples, where we have biological triplicates for both the control condition, such as cell line "A" untreated, `UNTREATED_A` and the treatment condition, such as cell line "A" treated with a compound, `TREATED_A` groups, and the third replicate in the `TREATED_A` group has been a technical replicate as a result of being sequenced twice. In this example, INPUT control is available `INPUT_A` with no replicates. ```csv title="samplesheet.csv" sample,fastq_1,fastq_2,replicate,control,control_replicate -CONTROL,AEG588A1_S1_L002_R1_001.fastq.gz,AEG588A1_S1_L002_R2_001.fastq.gz,1,, -CONTROL,AEG588A2_S2_L002_R1_001.fastq.gz,AEG588A2_S2_L002_R2_001.fastq.gz,2,, -CONTROL,AEG588A3_S3_L002_R1_001.fastq.gz,AEG588A3_S3_L002_R2_001.fastq.gz,3,, -TREATMENT,AEG588A4_S4_L003_R1_001.fastq.gz,,1,CONTROL,1 -TREATMENT,AEG588A5_S5_L003_R1_001.fastq.gz,,2,CONTROL,2 -TREATMENT,AEG588A6_S6_L003_R1_001.fastq.gz,,3,CONTROL,3 -TREATMENT,AEG588A6_S6_L004_R1_001.fastq.gz,,3,CONTROL,3 +INPUT_A,IEG577I1_S1_L001_R1_001.fastq.gz,IEG577I1_S1_L002_R2_001.fastq.gz,1,, +UNTREATED_A,AEG588A1_S1_L002_R1_001.fastq.gz,AEG588A1_S1_L002_R2_001.fastq.gz,1,INPUT_A,1 +UNTREATED_A,AEG588A2_S2_L002_R1_001.fastq.gz,AEG588A2_S2_L002_R2_001.fastq.gz,2,INPUT_A,1 +UNTREATED_A,AEG588A3_S3_L002_R1_001.fastq.gz,AEG588A3_S3_L002_R2_001.fastq.gz,3,INPUT_A,1 +TREATED_A,AEG588A4_S4_L003_R1_001.fastq.gz,,1,INPUT_A,1 +TREATED_A,AEG588A5_S5_L003_R1_001.fastq.gz,,2,INPUT_A,1 +TREATED_A,AEG588A6_S6_L003_R1_001.fastq.gz,,3,INPUT_A,1 +TREATED_A,AEG588A6_S6_L004_R1_001.fastq.gz,,3,INPUT_A,1 ``` | Column | Description |