Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chess sim result wit all NAN #16

Open
chenggang108 opened this issue Nov 13, 2020 · 13 comments
Open

chess sim result wit all NAN #16

chenggang108 opened this issue Nov 13, 2020 · 13 comments

Comments

@chenggang108
Copy link

Hi,

I have an issue similar to #5 and # 9. The results from chess sim are all NAN. I checked the conversation in #5 and #9 issues, but my situation looks different.

Here is how I run it:

First of all, I downloaded the example files and ran them successfully. It means my system works.

I generate a bed file by chess pair:
head mm10_chr1_3mb_win_100kb_step.bed
chr1 1 3000001 chr1 1 3000001 0 . + +
chr1 100001 3100001 chr1 100001 3100001 1 . + +
chr1 200001 3200001 chr1 200001 3200001 2 . + +
chr1 300001 3300001 chr1 300001 3300001 3 . + +
chr1 400001 3400001 chr1 400001 3400001 4 . + +
chr1 500001 3500001 chr1 500001 3500001 5 . + +
chr1 600001 3600001 chr1 600001 3600001 6 . + +
chr1 700001 3700001 chr1 700001 3700001 7 . + +
chr1 800001 3800001 chr1 800001 3800001 8 . + +
chr1 900001 3900001 chr1 900001 3900001 9 . + +

Then I run: chess sim reference.balanced.chr1.cool query.chr1.cool mm10_chr1_3mb_win_100kb_step_test.bed test3_chr1.tsv

The cool files are balanced by cooler and the resolution are 20kb

Here is how the log shows: 2020-11-09 19:38:59,424 INFO Note: detected 72 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
2020-11-09 19:38:59,424 INFO Note: NumExpr detected 72 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2020-11-09 19:38:59,424 INFO NumExpr defaulting to 8 threads.
2020-11-09 19:39:02,755 INFO CHESS version: 0.3.4
2020-11-09 19:39:02,755 INFO FAN-C version: 0.9.6
2020-11-09 19:39:02,759 INFO Loading reference contact data
Expected 100% (3209946 of 3209946) |#####| Elapsed Time: 0:06:28 Time: 0:06:28
Expected 100% (5892473 of 5892473) |#####| Elapsed Time: 0:11:48 Time: 0:11:48
2020-11-09 21:00:25,584 INFO Loading region pairs
2020-11-09 21:00:25,783 INFO Launching workers
2020-11-09 21:00:26,240 INFO Submitting pairs for comparison
2020-11-09 21:02:40,942 INFO Could not compute similarity for 1925 region pairs.This can be due to faulty coordinates, too smallregion sizes or too many unmappable bins.

I couldn't figure out the problem. The window size is big engoug. I also tried to remove 'chr' in the bed file like said in #5. But it does not work.

Could you please help with it?

Thanks

Gang

@nickmachnik
Copy link
Collaborator

Hi @chenggang108,
I am not sure what the problem here is, so we will have to sleuth a bit.
First of all, could you please re-install chess to update to the latest release (0.3.5)?
I think it would suffice to run

pip install chess-hic --upgrade
pip install fanc --upgrade

Then, to speed up things a bit, it might make sense to convert the input files to fanc format, then you won't need to wait 1.5 hours every time you run chess sim on these data. You can do that with

fanc from-cooler

Could you then re-run the analysis on the converted data and paste the logs here?
Best,
Nick

@chenggang108
Copy link
Author

Hi Nick,
Thank you for your swift reply. I will try it and let you know.

Thanks
Gang

@biozzq
Copy link

biozzq commented Nov 22, 2020

Dear @nickmachnik

Same problem for me after converting the cool data to fanc format using following command
fanc from-coole case.cool case.hic

The logs are following:

2020-11-22 22:02:43,563 INFO Running 'chess sim -p 6 case.hic treat.hic chr2_1mb_win_100kb_step.bed chr2.result'
2020-11-22 22:03:31,765 INFO CHESS version: 0.3.5
2020-11-22 22:03:31,780 INFO FAN-C version: 0.9.7
2020-11-22 22:03:31,806 INFO Loading reference contact data
2020-11-22 22:10:50,411 INFO Loading query contact data
2020-11-22 22:16:16,401 INFO Loading region pairs
2020-11-22 22:16:16,493 INFO Launching workers
2020-11-22 22:16:18,605 INFO Submitting pairs for comparison
2020-11-22 22:16:19,616 INFO Could not compute similarity for 1812 region pairs.This can be due to faulty coordinates, too smallregion sizes or too many unmappable bins
2020-11-22 22:16:34,566 INFO Finished 'chess sim -p 6 case.hic treat.hic chr2_1mb_win_100kb_step.bed chr2.result'
Closing remaining open files:case.hic...donetreat.hic...done

Hope for your help.

Best wishes,
Zheng Zhuqing

@nickmachnik
Copy link
Collaborator

Hi @biozzq , I think I will have to try to reproduce this to understand what is going on. Do you have a suggestion for small example dataset in cooler format that I could use for this (not necessarily yours, I understand if you don't want to share that data)?

@chenggang108 , does the error persist for you after the upgrade?

@biozzq
Copy link

biozzq commented Nov 23, 2020

Dear @nickmachnik
I would like to share my data with you. If possible, you can download the cool file and the genome size from following link;

https://drive.google.com/drive/folders/1dm66NJD8LgZ-N8HTNNo4FKkYIWI65zCc?usp=sharing

Hope these files can help you.

Best wishes,
Zheng zhuqing

@biozzq
Copy link

biozzq commented Nov 23, 2020

Dear @nickmachnik

The commands I used are as following

fanc from-cooler 0h_for_tad_pairs_no_YM_40kb.cool 0h_for_tad_pairs_no_YM_40kb.hic
fanc from-cooler 60h_for_tad_pairs_no_YM_40kb.cool 60h_for_tad_pairs_no_YM_40kb.hic
chess pairs --file-input new_mm10_gsize --chromosome chr2 4000000 2000000 chr2_4mb_win_2mb_step.bed
chess sim -p 6 0h_for_tad_pairs_no_YM_40kb.hic 60h_for_tad_pairs_no_YM_40kb.hic chr2_4mb_win_2mb_step.bed chr2.result

Best wishes,
Zheng zhuqing

@chenggang108
Copy link
Author

Hi @nickmachnik

I did all the updates and then tried three things:

First I ran chess sim with cool files; not working

Second, I converted .cool file to .fanc files by fanc from-cooler; not working

I finally prepare .hic with my allvalidpairs generated by hicpro; these .hic files work, but I need to remove 'chr' from the bed file prepared by chess pair.

I did not generate .fanc file from allvalidpairs only because I am not familiar with FNA-C.

It looks there is some thing wrong with the cool files

@nickmachnik
Copy link
Collaborator

@biozzq I can reproduce the all nan output with your data, but I don't know what is wrong yet, I will try to find out.
@chenggang108 do you know what was wrong with the cool files?
@biozzq, Following up from chenggang108's comment, are you completely sure that the data / the cool files are ok?

@biozzq
Copy link

biozzq commented Nov 25, 2020

Dear @nickmachnik

Thank you very much, I will try to use the .hic generated by hicpro. More, which normalization should be done before running chess? From your publication (following part), you used the KR normalization but not ICE. However, I used ICE using hicpro most of time. Also, from following context, i think you should do the normalization after masking the bins as zero. Is this right?

"Finally, bins with less than 25% (human) or 10% (mouse) of the median number of fragments per bin were masked and the matrix was normalized using Knight–Ruiz (KR) matrix balancing on each chromosome independently."

Best wishes,
Zheng zhuqing

@nickmachnik
Copy link
Collaborator

nickmachnik commented Nov 30, 2020

Hi @biozzq ,
We used KR balancing, but I believe ICE should be fine too. You have to mask bins before balancing, as stated in the paper. Balancing should give you equal sums in all rows, so you won't find poorly mappable bins after.

I am not sure what you mean by 'masking the bins as zero' though, could you elaborate?

@nickmachnik
Copy link
Collaborator

Hi, @kaukrise found a potential fix for this issue, see #23 (comment)

@biozzq
Copy link

biozzq commented Dec 1, 2020

Dear @nickmachnik

Sorry, I was not clear. As masking bins can be done by different ways, for example, treating the interaction frequency between these bins as zero, and can also remove these bins from the concat maps. Thus, I want to confirm which way you used in your study. Thank you.

Best wishes,
Zheng zhuqing

@nickmachnik
Copy link
Collaborator

nickmachnik commented Dec 2, 2020

FAN-C uses numpy masked arrays for masking bins. They are simply ignored in downstream analyses, this should be different from setting them to 0. You can read more about the FAN-C pipeline here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants