griffin_merge_sites.py hangs and can't finish #17

robinycfang · 2024-04-27T21:32:17Z

Hi

At the last step of Griffin, I used griffin_merge_sites.py to merge TFBSs for each TF, When I used top 1k sites, everything worked fine, but when I switched to top 10k TFBSs, this script seemed to get stuck at the last gene and couldn't finish running. I guess the issue is the multiprocessing? Any solution for that?

Thanks!

adoebley · 2024-08-30T20:29:49Z

Hi, thanks for the question and sorry for the delay in responding. From your description, I'm not sure what is going on. I'm hoping you figured out the solution, but if not could you post more details? (what did the log files show? Did you re-run the calc_cov for all 10,000 TFBS or just rerun the merge_sites step?)

Ugreek95 · 2024-09-05T13:13:36Z

Hi, I'm commenting this thread because i think I might have a related issue:

I followed the tutorial and was able to complete the analysis using the demo file provided without any problems. However, I am experiencing difficulties when applying the same pipeline to my .bam files. Specifically, the program crashes and generates the following error log when launching the Nucleosome Profiling step:

snakemake -s griffin_nucleosome_profiling.snakefile --cores 1

Using shell: /bin/bash
Provided cores: 1 (use --cores to define parallelism)
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 all
1 calc_cov
1 generate_plots
1 merge_sites
4

[Thu Aug 29 12:48:32 2024]
rule calc_cov:
input: /tank/USB_UMBE/NUCLEOSOME_POSITIONING/Griffin-main/SAMPLES/TRIMMED_ALIGN/ELB04-T00-1121_mrgd_trim.fullpipe.srt.bam, /tank/USB_UMBE/NUCLEOSOME_POSITIONING/Griffin-main/UMBE_ANALYSIS/griffin_GC_and_mappability_correction/results/GC_bias/first_sample.GC_bias.txt
output: tmp/first_sample/tmp_bigWig/first_sample.uncorrected.bw, tmp/first_sample/tmp_bigWig/first_sample.GC_corrected.bw, tmp/first_sample/tmp_pybedtools
jobid: 3
wildcards: samples=first_sample

parameters:
sample_name = "first_sample"
bam_path = "/tank/USB_UMBE/NUCLEOSOME_POSITIONING/Griffin-main/SAMPLES/TRIMMED_ALIGN/ELB04-T00-1121_mrgd_trim.fullpipe.srt.bam"
GC_bias_path = "/tank/USB_UMBE/NUCLEOSOME_POSITIONING/Griffin-main/UMBE_ANALYSIS/griffin_GC_and_mappability_correction/results/GC_bias/first_sample.GC_bias.txt"
mappability_bias_path = "none"
tmp_dir = "tmp"
ref_seq_path = "/tank/USB_UMBE/NUCLEOSOME_POSITIONING/Griffin-main/Ref/hg38.fa"
mappability_bw = "/tank/USB_UMBE/NUCLEOSOME_POSITIONING/Griffin-main/Ref/k100.Umap.MultiTrackMappability.bw"
chrom_sizes_path = "/tank/USB_UMBE/NUCLEOSOME_POSITIONING/Griffin-main/Ref/hg38.standard.chrom.sizes"
sites_yaml = "/tank/USB_UMBE/NUCLEOSOME_POSITIONING/Griffin-main/UMBE_ANALYSIS/griffin_nucleosome_profiling/config/sites.yaml"
griffin_scripts_dir = "/tank/USB_UMBE/NUCLEOSOME_POSITIONING/Griffin-main/scripts"
chrom_column = "Chrom"
position_column = "position"
strand_column = "Strand"
chroms = ['chr1', 'chr2', 'chr3', 'chr4', 'chr5', 'chr6', 'chr7', 'chr8', 'chr9', 'chr10', 'chr11', 'chr12', 'chr13', 'chr14', 'chr15', 'chr16', 'chr17', 'chr18', 'chr19', 'chr20', 'chr21', 'chr22']
norm_window = [-5000, 5000]
sz_range = [100, 200]
map_q = 20
number_of_sites = "none"
sort_by = "none"
ascending = "none"
CPU = 8

Skipping mappability correction
CTCF_demo processing all 1000 sites
Total sites (fw/rv/undirected/total): 0/0/1000/1000
Intervals to fetch: 982
Total bp to fetch: 10110116
Max fetch length: 20382 bp
Starting fetch
Done with fetch 0 min 4 sec
Starting export
Done with export 0 min 4 sec

real 0m5.126s
user 0m25.348s
sys 0m1.303s
Removing temporary output file tmp/first_sample/tmp_pybedtools.
[Thu Aug 29 12:48:37 2024]
Finished job 3.
1 of 4 steps (25%) done

[Thu Aug 29 12:48:37 2024]
rule merge_sites:
input: tmp/first_sample/tmp_bigWig/first_sample.uncorrected.bw, tmp/first_sample/tmp_bigWig/first_sample.GC_corrected.bw
output: results/first_sample/first_sample.uncorrected.coverage.tsv, results/first_sample/first_sample.GC_corrected.coverage.tsv, tmp/first_sample/tmp_pybedtools2
jobid: 1
wildcards: samples=first_sample

Skipping mappability correction
Normalization window (rounded down to step): [-4995, 4995]
Saving window (rounded down to step): [-990, 990]
Center window (rounded down to step): [-30, 30]
FFT window (rounded down to step): [-960, 960]
Savgol filter smoothing window: 165
Ascending is: none
Excluding regions: ['/tank/USB_UMBE/NUCLEOSOME_POSITIONING/Griffin-main/Ref/encode_unified_GRCh38_exclusion_list.bed', '/tank/USB_UMBE/NUCLEOSOME_POSITIONING/Griffin-main/Ref/hg38_centromeres.bed', '/tank/USB_UMBE/NUCLEOSOME_POSITIONING/Griffin-main/Ref/hg38_gaps.bed', '/tank/USB_UMBE/NUCLEOSOME_POSITIONING/Griffin-main/Ref/hg38_fix_patches.bed', '/tank/USB_UMBE/NUCLEOSOME_POSITIONING/Griffin-main/Ref/hg38_alternative_haplotypes.bed']
Excluding bins with coverage outliers: True
Excluding bins with zero mappability: True
excluding: /tank/USB_UMBE/NUCLEOSOME_POSITIONING/Griffin-main/Ref/encode_unified_GRCh38_exclusion_list.bed
excluding: /tank/USB_UMBE/NUCLEOSOME_POSITIONING/Griffin-main/Ref/hg38_centromeres.bed
excluding: /tank/USB_UMBE/NUCLEOSOME_POSITIONING/Griffin-main/Ref/hg38_gaps.bed
excluding: /tank/USB_UMBE/NUCLEOSOME_POSITIONING/Griffin-main/Ref/hg38_fix_patches.bed
excluding: /tank/USB_UMBE/NUCLEOSOME_POSITIONING/Griffin-main/Ref/hg38_alternative_haplotypes.bed
Analyzing 1 site lists
CTCF_demo processing all 1000 sites
CTCF_demo (fw/rv/undirected/total): 0/0/1000/1000
CTCF_demo uncorrected starting fetch 0 min 0 sec
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/tank/home/umbertog/miniconda3/envs/griffin_UMBE/lib/python3.7/multiprocessing/pool.py", line 121, in worker
result = (True, func(*args, **kwds))
File "/tank/home/umbertog/miniconda3/envs/griffin_UMBE/lib/python3.7/multiprocessing/pool.py", line 44, in mapstar
return list(map(*args))
File "/tank/USB_UMBE/NUCLEOSOME_POSITIONING/Griffin-main/scripts/griffin_merge_sites.py", line 521, in merge_sites
results_dict[key]['coverage'] = fetch_bw_values(results_dict[key]['input_path'],current_sites,site_name,key)
File "/tank/USB_UMBE/NUCLEOSOME_POSITIONING/Griffin-main/scripts/griffin_merge_sites.py", line 356, in fetch_bw_values
if len(values)<(norm_window[1]-norm_window[0]):
TypeError: object of type 'numpy.float32' has no len()
"""
"""

The above exception was the direct cause of the following exception:
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/tank/USB_UMBE/NUCLEOSOME_POSITIONING/Griffin-main/scripts/griffin_merge_sites.py", line 612, in
results = p.map(merge_sites, to_do_list, 1) #Send only one interval to each processor at a time.
File "/tank/home/umbertog/miniconda3/envs/griffin_UMBE/lib/python3.7/multiprocessing/pool.py", line 268, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "/tank/home/umbertog/miniconda3/envs/griffin_UMBE/lib/python3.7/multiprocessing/pool.py", line 657, in get
raise self._value
TypeError: object of type 'numpy.float32' has no len()

real 0m7.892s
user 0m1.228s
sys 0m0.772s
[Thu Aug 29 12:48:45 2024]
Error in rule merge_sites:
jobid: 1
output: results/first_sample/first_sample.uncorrected.coverage.tsv, results/first_sample/first_sample.GC_corrected.coverage.tsv, tmp/first_sample/tmp_pybedtools2
shell:
time /tank/USB_UMBE/NUCLEOSOME_POSITIONING/Griffin-main/scripts/griffin_merge_sites.py --sample_name first_sample --uncorrected_bw_path tmp/first_sample/tmp_bigWig/first_sample.uncorrected.bw --GC_corrected_bw_path tmp/first_sample/tmp_bigWig/first_sample.GC_corrected.bw --GC_map_corrected_bw_path none --mappability_correction False --tmp_dir tmp --results_dir results --mappability_bw /tank/USB_UMBE/NUCLEOSOME_POSITIONING/Griffin-main/Ref/k100.Umap.MultiTrackMappability.bw --chrom_sizes_path /tank/USB_UMBE/NUCLEOSOME_POSITIONING/Griffin-main/Ref/hg38.standard.chrom.sizes --sites_yaml /tank/USB_UMBE/NUCLEOSOME_POSITIONING/Griffin-main/UMBE_ANALYSIS/griffin_nucleosome_profiling/config/sites.yaml --griffin_scripts /tank/USB_UMBE/NUCLEOSOME_POSITIONING/Griffin-main/scripts --chrom_column Chrom --position_column position --strand_column Strand --chroms chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr21 chr22 --norm_window -5000 5000 --save_window -1000 1000 --center_window -30 30 --fft_window -960 960 --fft_index 10 --smoothing_length 165 --exclude_paths /tank/USB_UMBE/NUCLEOSOME_POSITIONING/Griffin-main/Ref/encode_unified_GRCh38_exclusion_list.bed /tank/USB_UMBE/NUCLEOSOME_POSITIONING/Griffin-main/Ref/hg38_centromeres.bed /tank/USB_UMBE/NUCLEOSOME_POSITIONING/Griffin-main/Ref/hg38_gaps.bed /tank/USB_UMBE/NUCLEOSOME_POSITIONING/Griffin-main/Ref/hg38_fix_patches.bed /tank/USB_UMBE/NUCLEOSOME_POSITIONING/Griffin-main/Ref/hg38_alternative_haplotypes.bed --step 15 --CNA_normalization False --individual False --smoothing True

I have checked that the .bam files meet the requirements described in the documentation, but I cannot work out what might be causing the error. I was wondering if you could provide me with any pointers or suggestions to resolve this issue.

The main difference, is my sequencing bams are coming from whole genome Nanopore sequencing data with an average read length of 165bps and 8-15Million reads each.

I tried several aligning methods and also followed entirely your pipeline using:

bwa mem -M -t 16 -p /tank/USB_UMBE/NUCLEOSOME_POSITIONING/Griffin-main/Ref/hg38.fa

I tried to align trimmed and untrimmed fastqs, removing reads longer reads that might be there due to the differences between nanopore and illumina, I triple-checked my griffin_GC_and_mappability_correction.snakefile, config.yaml and samples.yaml files
but still can’t solve the issue.

Thank you in advance for your time

adoebley · 2024-09-10T05:15:24Z

Hi @Ugreek95, Thanks for the question and log file. It looks like the bw.values function is fetching a float rather than a list from the coverage bigwig. The problem is coming from this part of the script:

Griffin/scripts/griffin_merge_sites.py

Lines 349 to 356 in b624c7a

    
           for i in range(len(current_sites)): 
        
               chrom,start,end,strand = current_sites.iloc[i][[chrom_column,'fetch_start','fetch_end',strand_column]] 
        
               values = bw.values(chrom, start, end, numpy=True) 
        
               values = np.nan_to_num(values) #turn nan into zero because bw doesn't store zero 
        
               #if the window extends beyond the end of the chromosome add np.nan to fill the gap 
        
               if len(values)<(norm_window[1]-norm_window[0]):

I'm not sure why this is happening but have a few thoughts on things you could check:

Did you change anything in the config other than swapping out the demo bam for your bam?

What version of pyBigWig are you using? I tested griffin using pyBigWig 0.3.17 so a different version could be causing a problem.

You could also try modifying the code to print out some variables (chrom, start, end, strand, and values) when it throws the error. If you tell me those variables, I can try to figure out why the 'values' isn't the same format as the one generated by the demo file.

Thanks,
Anna-Lisa

robinycfang · 2024-10-05T22:13:09Z

Hi,

The log file didn't report any error, it was just stuck at merging sites for the last gene (AHR in my case).

ARID3A max_bin_coverage is 1111.0 midpoints
ARID3A masking sites with > 27.0 midpoints
BATF merge complete 17 min 57 sec
ARID3A - excluding outliers.
ARID3A max_bin_coverage is 1111.0 midpoints
ARID3A masking sites with > 27.0 midpoints
ARID3A - excluding outliers.
ARID3A max_bin_coverage is 1111.0 midpoints
ARID3A masking sites with > 27.0 midpoints
ARID3A uncorrected averaging sites
ARID3A max_bin_coverage is 1111.0 midpoints
ARID3A masking sites with > 27.0 midpoints
ARID3A uncorrected averaging sites
ARID3A uncorrected averaging sites
ARID3A uncorrected smoothing
ARID3A uncorrected correcting for read depth
ARID3A GC_corrected averaging sites
ARID3A GC_corrected smoothing
ARID3A GC_corrected correcting for read depth
ARID3A merge complete 17 min 58 sec
AHR - excluding outliers.
AHR max_bin_coverage is 1181.0 midpoints
AHR masking sites with > 34.0 midpoints
AHR uncorrected averaging sites
AHR uncorrected smoothing
AHR uncorrected correcting for read depth
AHR GC_corrected averaging sites
AHR GC_corrected smoothing
AHR GC_corrected correcting for read depth
AHR merge complete 17 min 58 sec

I used individual cmd instead of snakemake. The code works fine for 1k TFBSs but not 10k.

python ${griffin_dir}/scripts/griffin_coverage.py
--sample_name ${plasma}
--bam ${pbam}
--GC_bias ${out}/GC_bias/${plasma}.GC_bias.txt
--mappability_bias NA
--mappability_correction False
--tmp_dir ${tmp}
--reference_genome ${fa}
--mappability_bw ${map_bw}
--chrom_sizes_path ${chrom_size}
--sites_yaml ${sites}
--griffin_scripts_dir ${griffin_dir}/scripts
--chrom_column Chromosome
--position_column Position
--size_range 100 220
--map_quality 20
--number_of_sites none
--ascending none
--CPU 30

python ${griffin_dir}/scripts/griffin_merge_sites.py
--sample_name ${plasma}
--uncorrected_bw_path ${tmp}/${plasma}/tmp_bigWig/${plasma}.uncorrected.bw
--GC_corrected_bw_path ${tmp}/${plasma}/tmp_bigWig/${plasma}.GC_corrected.bw
--GC_map_corrected_bw_path NA
--mappability_correction False
--tmp_dir ${tmp}
--results_dir ${site_out}
--mappability_bw ${map_bw}
--chrom_sizes_path ${chrom_size}
--sites_yaml ${sites}
--griffin_scripts_dir ${griffin_dir}/scripts
--chrom_column Chromosome
--position_column Position
--center_window -30 30
--exclude_paths ${griffin_dir}/Ref/encode_unified_GRCh38_exclusion_list.bed ${griffin_dir}/Ref/hg38_centromeres.bed ${griffin_dir}/Ref/hg38_gaps.bed ${griffin_dir}/Ref/hg38_fix_patches.bed ${griffin_dir}/Ref/hg38_alternative_haplotypes.bed
--step 15
--CNA_normalization False
--individual False
--smoothing True
--exclude_outliers True
--exclude_zero_mappability True
--CPU 30

robinycfang · 2024-10-05T22:15:35Z

I have 30x+ bam files, interestingly, it got stuck for some samples, but not all, so I was wondering if that is the memory issue? I gave them a lot of memory though...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

griffin_merge_sites.py hangs and can't finish #17

griffin_merge_sites.py hangs and can't finish #17

robinycfang commented Apr 27, 2024

adoebley commented Aug 30, 2024

Ugreek95 commented Sep 5, 2024 •

edited

Loading

adoebley commented Sep 10, 2024 •

edited

Loading

robinycfang commented Oct 5, 2024

robinycfang commented Oct 5, 2024

griffin_merge_sites.py hangs and can't finish #17

griffin_merge_sites.py hangs and can't finish #17

Comments

robinycfang commented Apr 27, 2024

adoebley commented Aug 30, 2024

Ugreek95 commented Sep 5, 2024 • edited Loading

adoebley commented Sep 10, 2024 • edited Loading

robinycfang commented Oct 5, 2024

robinycfang commented Oct 5, 2024

Ugreek95 commented Sep 5, 2024 •

edited

Loading

adoebley commented Sep 10, 2024 •

edited

Loading