Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatically catch intervals which extend beyond the chr? #54

Open
cafletezbrant opened this issue Oct 10, 2024 · 2 comments
Open

Automatically catch intervals which extend beyond the chr? #54

cafletezbrant opened this issue Oct 10, 2024 · 2 comments
Assignees
Labels
enhancement New feature or request

Comments

@cafletezbrant
Copy link

Hi gReLU team,

Here is a reprex of a pair of SNPs that I am trying to predict using Borzoi. The prediction interval of the first is contained within the chr, but not the 2nd.

import grelu 
import grelu.resources
import pandas as pd

df = pd.DataFrame({
    'ref': ['G', 'C'],
    'alt': ['A', 'T'],
    'chrom': ['chr17']*2,
    'start': [80920469, 81005659],
    'end': [80920470, 81005660],
    'pos': [80920469, 81005659]
})
grelu.data.dataset.VariantDataset(variants=df, seq_len=524288, genome='hg19')

This gives the error

AssertionError: All input sequences must have the same length.

It would be great for this to be caught automatically somehow :)

@avantikalal avantikalal added the enhancement New feature or request label Oct 14, 2024
@avantikalal avantikalal self-assigned this Oct 15, 2024
@HelloWorldLTY
Copy link

Hi, I have a similar quesetion here. I am using the borzoi and grelu.data.dataset.AnnDataSeqDataset. I have checked that my datasets contain the gene with all seq len as 524288, but stilll receive the assertion error:

File /home/tl688/.conda/envs/evo/lib/python3.11/site-packages/grelu/sequence/format.py:414, in convert_input_type(inputs, output_type, genome, add_batch_axis)
    412         return strings_to_one_hot(inputs, add_batch_axis=add_batch_axis)
    413     elif output_type == "indices":
--> 414         return strings_to_indices(inputs, add_batch_axis=add_batch_axis)
    416 # Convert indices
    417 if input_type == "indices":

File /home/tl688/.conda/envs/evo/lib/python3.11/site-packages/grelu/sequence/format.py:251, in strings_to_indices(strings, add_batch_axis)
    247         return arr
    249 # Convert multiple sequences; they must all have equal length
    250 else:
--> 251     assert check_equal_lengths(
    252         strings
    253     ), "All input sequences must have the same length."
    254     return np.stack(
    255         [[BASE_TO_INDEX_HASH[base] for base in string] for string in strings]
    256     ).astype(np.int8)

AssertionError: All input sequences must have the same length.

@cafletezbrant
Copy link
Author

Update: the error I was reporting is actually independent of sequence length / extending beyond chromosome boundary.

I just walked the call stack for grelu.data.dataset.VariantDataset() and actually the source of my error is in VariantDataset()._load_alleles(), which internally calls grelu.sequence.format.strings_to_indices(). Basically, internally it is expected that I am loading alleles of a common length (presumably, this is expected to be a SNP, rather than indel).

Here is an updated reprex:

from grelu.sequence.format import strings_to_indices
alleles = ['A', 'C', 'AC', 'G']
strings_to_indices(alleles)
## AssertionError: All input sequences must have the same length.
## VERSUS
alleles = ['A', 'C', 'G']
strings_to_indices(alleles)
## array([[0],
##       [1],
##       [2]], dtype=int8)

So the simple fix for me as a user is just to focus on SNPs, at least for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants