Skip to content

Latest commit

 

History

History
50 lines (41 loc) · 1.67 KB

input_data_preparation.md

File metadata and controls

50 lines (41 loc) · 1.67 KB

input data preparation

The input folder must have the following structure:

input_fld/
├── reads
│   ├── sample_1.fastq.gz
│   ├── sample_2.fastq.gz
│   └── ...
└── references
    ├── ref_1.fa
    ├── ref_1.gbk (optional)
    ├── ref_2.fa
    ├── ref_2.gbk (optional)
    └── ...

The name of the reads and reference files is arbitrary, but the extensions must be .fastq.gz and .fa respectively.

The name of records in reference files must be short (less than 20 characters) and not contain any spaces since it will be used to generate names of the subfiles.

For the part of the pipeline that extract annotations for relevant mutated positions, a GenBank file must be present amongst the references. It must have the same prefix as the references, and the records must have the same name as in the multi-fasta files.

configuration file

The run_config entry in the config file is used to instruct the pipeline on where the input data are located, where the output data should be saved, and which reference should be used for each sample. This must have the following structure:

run_config:
    input: "input_fld"
    output: "output_fld"
    pileups:
        ref_1:
        - "sample_1"
        - "sample_2"
        - ...
        ref_2:
        - "sample_1"
        - "sample_2"
        - ...
...

Thre config file is then passed as input to the pipeline using the --configfile flag:

snakemake --configfile myconfig.yml

The same config file also contains other options that control various parameters of the pipeline. These are described in the plot description file.