Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error - "Duplicated chromosome entries detected in samplesheet. Check your samplesheet." #338

Open
gurpreet-bioinfo opened this issue Jul 12, 2024 · 3 comments
Labels
user-query User queries & requests

Comments

@gurpreet-bioinfo
Copy link

gurpreet-bioinfo commented Jul 12, 2024

Description of the bug

Hi, I have added path for multiple vcfs inside samplesheet.csv as input to the pipeline and kept chr column empty as recommended ("If the target genomic data file contains multiple chromosomes, leave empty.").

However, I am keep on getting this error: Duplicated chromosome entries detected in samplesheet. Check your samplesheet.

Below is a section from my samplesheet.csv :

sampleset,path_prefix,chrom,format
proj1,/analysis/L12/vcf/L12, ,vcf
proj1,/analysis/L13/vcf/L13, ,vcf
proj1,/analysis/L14/vcf/L14, ,vcf
proj1,/analysis/L15/vcf/L15, ,vcf
  • I also tried to use -r v2.0.0-beta.1 but it says Project pgscatalog/pgsc_calc contains uncommitted changes -- Cannot switch to revision: v2.0.0-beta.1

Thanks.

Command used and terminal output

nextflow run pgscatalog/pgsc_calc -profile singularity \
    --input samplesheet.csv \
    --target_build GRCh38 \
    --pgs_id PGS001013,PGS001015 \
    --run_ancestry pgsc_HGDP+1kGP_v1.tar.zst \
    --outdir $PWD/results

Relevant files

No response

System information

Nextflow version: 23.10.1
Hardwar: HPC
Executor: Slurm
Container Engine: Singularity
OS: Linux
pgsc_calc v2.0.0-beta-gccfd636

@gurpreet-bioinfo gurpreet-bioinfo added the bug Something isn't working label Jul 12, 2024
@nebfield
Copy link
Member

The calculator works best with cohort data that have been imputed.

If you have one sample per VCF then you should merge your target genomes before using the calculator. Multiple rows in a samplesheet are for target genomes that have been split per chromosome.

@nebfield nebfield added user-query User queries & requests and removed bug Something isn't working labels Jul 15, 2024
@gurpreet-bioinfo
Copy link
Author

gurpreet-bioinfo commented Jul 15, 2024

Thanks @nebfield ! As per https://pgsc-calc.readthedocs.io/en/latest/how-to/prepare.html#vcf-from-wgs, does that mean I need to use plink2 to convert my all vcf files (each corresponding to a wgs from a patient) and that would be additional work which I did not expect by looking at the documentation?
In that case, how does the format of my samplesheet.csv should look like?
I am sorry but these items are not clear and straightforward from the documentation.

@nebfield
Copy link
Member

WGS data can cause variant matching problems with the current version of the calculator. The calculator works best with genotyping array data that have been imputed to increase variant density.

Some users have been able to create compatible VCFs from WGS data but this requires some manual work to 1) create gVCFs from BAM files 2) merge gVCFs to create a multi-sample gVCF and 3) include nonvariant sites in the gVCF.

If you're able to create a multi-sample VCF from the WGS data your samplesheet would look like:

sampleset,path_prefix,chrom,format
merged,path/to/merged,,vcf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
user-query User queries & requests
Projects
None yet
Development

No branches or pull requests

2 participants