Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add H5 GenomicFeatures support for more flexible target datatypes #200

Open
wants to merge 64 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
64 commits
Select commit Hold shift + click to select a range
3946957
selene changes for methylation model
kathyxchen Dec 6, 2022
81e8e45
remove f1 metric
kathyxchen Dec 15, 2022
1b14ecc
bugfix and some temporary profiling
kathyxchen Dec 16, 2022
3be4618
adjustments to loss breakdown and logging
kathyxchen Dec 19, 2022
20eea13
adjust return type for get_features_data
kathyxchen Dec 19, 2022
d34bf84
bugfix for get_feature_data
kathyxchen Dec 19, 2022
3b6f4ad
remove event object init
kathyxchen Dec 20, 2022
c5bad67
initial changes to h5 dataloader
kathyxchen Jan 11, 2023
5d5438a
methylation sampler excl
kathyxchen Jan 11, 2023
831b57a
add type specification to unpackbits
kathyxchen Jan 13, 2023
bc34a88
change loss
kathyxchen Mar 3, 2023
31f55ac
experimenting w loss
kathyxchen Mar 20, 2023
b40b8cd
add pearsonr
kathyxchen Mar 21, 2023
8b5c884
fix metrix nan removal bug
kathyxchen Mar 22, 2023
54ddea1
changes to sampler for positives-only hdf5
kathyxchen Apr 3, 2023
3bcd098
loss weighting
kathyxchen Apr 3, 2023
9e27969
adjust methylation perf metric output checking
kathyxchen Apr 3, 2023
462b729
revert multisampler and add spearmanr to training log
kathyxchen Apr 6, 2023
d0ae091
explicit metric fn checking, refine later
kathyxchen Apr 14, 2023
1fd2d96
minor adjustment to dataloader
kathyxchen Jul 7, 2023
6f75a45
non strand specific temp changes
kathyxchen Jul 10, 2023
8ac4db6
positives only sampler
kathyxchen Jul 14, 2023
58a868b
revamp dataloader, seq length flexibility
kathyxchen Aug 8, 2023
249f302
attempted a nonstrandspecific utils module, not being used
kathyxchen Aug 8, 2023
cb55bbc
evaluation classes
kathyxchen Aug 16, 2023
c761f36
revert non strand specific module
kathyxchen Sep 13, 2023
1ca2b37
trying to figure out unet dataloader changes
kathyxchen Sep 14, 2023
5591680
trying to figure out unet dataloader changes - add tgt shift
kathyxchen Sep 14, 2023
3fa4bb0
remove print debug statements
kathyxchen Sep 18, 2023
79af769
addr memory issue in eval
kathyxchen Sep 19, 2023
e0d0383
comment
kathyxchen Sep 20, 2023
4ce9c99
troubleshooting dataloader shift
kathyxchen Oct 18, 2023
3229569
add strand arg to _retrieve
kathyxchen Oct 20, 2023
d09da28
shift testing
kathyxchen Oct 20, 2023
4f82951
adjust casting of targets in dataloader
kathyxchen Nov 7, 2023
cd390dd
minor changes to eval and sampling
kathyxchen Jan 27, 2024
240f406
remove unused code in nonstrandspecific module
kathyxchen Mar 7, 2024
b7f9968
remove indicators from dataloader, clean up methylation performance m…
kathyxchen Mar 15, 2024
eebe14b
remove unused files from version control, adjust multi sampler get da…
kathyxchen May 1, 2024
152ad61
remove commented out code in shift sections
kathyxchen May 1, 2024
6fc0f7b
remove ind commented out code in multisampler
kathyxchen May 1, 2024
fa69345
add excl chr optional arg
kathyxchen May 1, 2024
3adbf23
make adjustments to nonstrandspecific, performancemetrics, for what c…
kathyxchen May 6, 2024
dca3ae4
remove files that we have merged the functionality into existing classes
kathyxchen May 7, 2024
53f498d
clean up indicator code that is no longer used
kathyxchen May 7, 2024
437092e
integrate changes for methylation prediction in training and metrics
kathyxchen May 28, 2024
6aabd4e
incorporate changes from previous PR on config yaml and model file sa…
kathyxchen May 29, 2024
947825d
update pytorch version constraint
kathyxchen May 29, 2024
eb6edba
variable name fix for copying model file / directory
kathyxchen Jul 8, 2024
e9e54db
merge changes from 0.5.3
kathyxchen Jul 15, 2024
59f70f9
remove train methylation model class
kathyxchen Jul 15, 2024
95a234a
fix bug in random positions sampler with exclude_chrs, overload targe…
kathyxchen Dec 6, 2024
55d30b8
remove unused method in genomicfeaturesh5
kathyxchen Dec 6, 2024
597f570
adjust strand vs feature (target) column ordering assumptions in tabi…
kathyxchen Dec 6, 2024
34fa342
adjust descriptions for target classes
kathyxchen Dec 6, 2024
8d84a17
minor adjustments to docstrings
kathyxchen Dec 6, 2024
fb2cd99
tuple output handling for non strand specific
kathyxchen Dec 6, 2024
e960cee
line breaks for formatting in performance metrics file
kathyxchen Dec 6, 2024
81d676b
Merge remote-tracking branch 'upstream/master' into methylation-sampler
kathyxchen Dec 6, 2024
3fe05aa
adjust CLI config parsing so that a copy of the input config file is …
kathyxchen Dec 11, 2024
743c2db
update versioning
kathyxchen Dec 11, 2024
7269d79
add new dependency
kathyxchen Dec 11, 2024
7b09b6a
adjustment in
kathyxchen Dec 11, 2024
7447790
update release notes for 0.6.0
kathyxchen Dec 11, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 20 additions & 0 deletions RELEASE_NOTES.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,26 @@

This is a document describing new functionality, bug fixes, breaking changes, etc. associated with Selene version releases from v0.5.0 onwards.

## Version 0.6.0
- `config_utils.py`: Add additional information saved upon running Selene. Specifically, we now save the version of Selene that the latest run used, make a copy of the input configuration file, and save this along with the model architecture file in the output directory. This adds a new dependency to Selene, the package `ruamel.yaml`
- `H5Dataloader` and `_H5Dataset`: Previously `H5Dataloader` had a number of arguments that were used to then initialize `_H5Dataset` internally. One major change in this version is that we now ask that users initialize `_H5Dataset` explicitly and then pass it to `H5Dataloader` as a class argument. This makes the two classes consistent with the PyTorch specifications for `Dataset` and `DataLoader` classes, enabling them to be compatible with different data parallelization configurations supported by PyTorch and the PyTorch Lightning framework.
- `_H5Dataset` class initialization optional arguments:
- `unpackbits` can now be specified separately for sequences and targets by way of `unpackbits_seq` and `unpackbits_tgt`
- `use_seq_len` enables subsetting to the center `use_seq_len` length of the sequences in the dataset.
- `shift` (particularly paired with `use_seq_len`) allows for retrieving sequences shifted from the center position by `shift` bases. Note currently `shift` only allows shifting in one direction, depending on whether you pass in a positive or negative integer.
- `GenomicFeaturesH5`: This is a new targets class to handle continuous-valued targets, stored in an HDF5 file, that can be retrieved based on genomic coordinate. Once again, genomic regions are stored in a tabix-indexed .bed file, with the main change being that the BED file now specifies for each genomic regions the index of the row in the HDF5 matrix that contains all the target values to predict. If multiple target rows are returned for a query region, the average of those rows is returned.
- `RandomPositionsSampler`:
- `exclude_chrs`: Added a new optional argument which by default excludes all nonstandard chromosomes `exclude_chrs=['_']` by ignoring all chromosomes with an underscore in the name. Pass in a list of chromosomes or substrings to exclude. When loading possible sampling positions, the class now iterates through the `exclude_chrs` list and checks for each substring `s` in list if `s in chrom`, and if so, skips that chromosome entirely.
- Internal function `_retrieve` now takes in an optional argument `strand` (default `None`) to enable explicit retrieval of a sequence at `chrom, position` for a specific side. The default behavior of the `RandomPositionsSampler` class remains the same, with the strand side randomly selected for each genomic position sampled.
- `PerformanceMetrics`:
- Now supports `spearmanr` and `pearsonr` from `scipy.stats`. Room for improvement to generalize this class in the future.
- The `update` function now has an optional argument `scores` which can pass in a subset of the metrics as `list(str)` to compute.
- `TrainModel`:
- `self.step` starts from `self._start_step`, which is non-zero if loaded from a Selene-saved checkpoint
- removed call to `self._test_metrics.visualize` in `evaluate` since the visualize method does not generalize well.
- `NonStrandSpecific`: Can now handle a model outputting two outputs in `forward`, will handle by taking either the mean or max of each of the two individual outputs for their forward and reverse predictions. A custom `NonStrandSpecific` class is recommended for more specific cases.


## Version 0.5.3
- Adjust dependency requirements

Expand Down
3 changes: 2 additions & 1 deletion selene-cpu.yml
Original file line number Diff line number Diff line change
Expand Up @@ -11,8 +11,9 @@ dependencies:
- matplotlib=2.0.2
- numpy=1.21.4
- pandas=0.20.3
- python=3.6.2
- python=3.9
- pyyaml=5.1
- ruamel.yaml=0.18.6
- scikit-learn=0.19.0
- scipy=1.1.0
- seaborn=0.8.1
Expand Down
11 changes: 6 additions & 5 deletions selene-gpu.yml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ channels:
- bioconda
- conda-forge
dependencies:
- cython=0.29.3
- cython=0.29.24
- click==7.1.2
- docopt=0.6.2
- h5py=2.9.0
Expand All @@ -15,11 +15,12 @@ dependencies:
- statsmodels=0.9.0
- pytabix=0.0.2
- matplotlib=2.2.2
- python=3.6.5
- numpy=1.15.1
- python=3.9
- ruamel.yaml=0.18.6
- numpy=1.21.4
- plotly=2.7.0
- cudatoolkit=10.0.130=0
- pytorch=1.0.1=py3.6_cuda10.0.130_cudnn7.4.2_2
- torchvision=0.2.2=py_3
- pytorch=2.4.1=py3.9_cuda11.8_cudnn9.1.0_0
- torchvision=0.20.0=py39_cu118
- pyfaidx=0.5.5.2
- seaborn=0.8.1
5 changes: 2 additions & 3 deletions selene_sdk/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@
import click

from selene_sdk import __version__
from selene_sdk.utils import load_path, parse_configs_and_run
from selene_sdk.utils import load_and_parse_configs_and_run


@click.command()
Expand All @@ -23,8 +23,7 @@
@click.option('--lr', type=float, help='If training, the optimizer learning rate', show_default=True)
def main(path, lr):
"""Build the model and trains it using user-specified input data."""
configs = load_path(path, instantiate=False)
parse_configs_and_run(configs, lr=lr)
load_and_parse_configs_and_run(path, lr=lr)


if __name__ == "__main__":
Expand Down
132 changes: 97 additions & 35 deletions selene_sdk/samplers/dataloader.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
which allow parallel sampling for any Sampler using
torch DataLoader mechanism.
"""
import random
import sys

import h5py
Expand Down Expand Up @@ -125,6 +126,27 @@ def worker_init_fn(worker_id):
self.seed = seed


def unpackbits_sequence(sequence, s_len):
sequence = np.unpackbits(sequence.astype(np.uint8), axis=-2)
nulls = np.sum(sequence, axis=-1) == sequence.shape[-1]
sequence = sequence.astype(float)
sequence[nulls, :] = 1.0 / sequence.shape[-1]
if sequence.ndim == 3:
sequence = sequence[:, :s_len, :]
else:
sequence = sequence[:s_len, :]
return sequence


def unpackbits_targets(targets, t_len):
targets = np.unpackbits(targets, axis=-1).astype(float)
if targets.ndim == 2:
targets = targets[:, :t_len]
else:
targets = targets[:self.t_len]
return targets


class _H5Dataset(Dataset):
"""
This class provides a Dataset that directly loads sequences and targets
Expand Down Expand Up @@ -160,13 +182,24 @@ class _H5Dataset(Dataset):
def __init__(self,
file_path,
in_memory=False,
unpackbits=False,
unpackbits=False, # implies unpackbits for both
unpackbits_seq=False,
unpackbits_tgt=False,
sequence_key="sequences",
targets_key="targets"):
targets_key="targets",
use_seq_len=None,
shift=None):
super(_H5Dataset, self).__init__()
self.file_path = file_path
self.in_memory = in_memory

self.unpackbits = unpackbits
self.unpackbits_seq = unpackbits_seq
self.unpackbits_tgt = unpackbits_tgt

self.use_seq_len = use_seq_len
self.shift = shift
self._seq_start, self._seq_end = None, None

self._initialized = False
self._sequence_key = sequence_key
Expand All @@ -178,15 +211,22 @@ def init(func):
def dfunc(self, *args, **kwargs):
if not self._initialized:
self.db = h5py.File(self.file_path, 'r')

if self.unpackbits:
self.s_len = self.db['{0}_length'.format(self._sequence_key)][()]
self.t_len = self.db['{0}_length'.format(self._targets_key)][()]
elif self.unpackbits_seq:
self.s_len = self.db['{0}_length'.format(self._sequence_key)][()]
elif self.unpackbits_tgt:
self.t_len = self.db['{0}_length'.format(self._targets_key)][()]

if self.in_memory:
self.sequences = np.asarray(self.db[self._sequence_key])
self.targets = np.asarray(self.db[self._targets_key])
else:
self.sequences = self.db[self._sequence_key]
self.targets = self.db[self._targets_key]

self._initialized = True
return func(self, *args, **kwargs)
return dfunc
Expand All @@ -195,25 +235,33 @@ def dfunc(self, *args, **kwargs):
def __getitem__(self, index):
if isinstance(index, int):
index = index % self.sequences.shape[0]
sequence = self.sequences[index, :, :]
targets = self.targets[index, :]
sequence = self.sequences[index]
targets = self.targets[index]

if self.unpackbits:
sequence = np.unpackbits(sequence, axis=-2)
nulls = np.sum(sequence, axis=-1) == sequence.shape[-1]
sequence = sequence.astype(float)
sequence[nulls, :] = 1.0 / sequence.shape[-1]
targets = np.unpackbits(
targets, axis=-1).astype(float)
if sequence.ndim == 3:
sequence = sequence[:, :self.s_len, :]
else:
sequence = sequence[:self.s_len, :]
if targets.ndim == 2:
targets = targets[:, :self.t_len]
else:
targets = targets[:self.t_len]
return (torch.from_numpy(sequence.astype(np.float32)),
torch.from_numpy(targets.astype(np.float32)))
sequence = unpackbits_sequence(sequence, self.s_len)
targets = unpackbits_targets(targets, self.t_len)
elif self.unpackbits_seq:
sequence = unpackbits_sequence(sequence, self.s_len)
elif self.unpackbits_tgt:
targets = unpackbits_targets(targets, self.t_len)

if self._seq_start is None:
self._seq_start = 0
self._seq_end = len(sequence)

if self.use_seq_len is not None:
mid = len(sequence) // 2
self._seq_start = int(mid - np.ceil(self.use_seq_len / 2))
self._seq_end = mid + self.use_seq_len // 2
if self.shift is not None:
self._seq_start += self.shift
self._seq_end += self.shift
sequence = sequence[self._seq_start:self._seq_end]

s = sequence.astype(np.float32)
return (torch.from_numpy(s), torch.from_numpy(targets))


@init
def __len__(self):
Expand Down Expand Up @@ -288,20 +336,38 @@ class H5DataLoader(DataLoader):

"""
def __init__(self,
filepath,
in_memory=False,
dataset,
num_workers=1,
use_subset=None,
batch_size=1,
shuffle=True,
unpackbits=False,
sequence_key="sequences",
targets_key="targets"):
seed=436,
sampler=None,
batch_sampler=None,
shuffle=True):
g = torch.Generator()
g.manual_seed(seed)

def worker_init_fn(worker_id):
worker_seed = torch.initial_seed() % 2**32
print("Worker seed", worker_seed)
np.random.seed(worker_seed)
random.seed(worker_seed)
torch.manual_seed(worker_seed)

args = {
"batch_size": batch_size,
"num_workers": 0 if in_memory else num_workers,
"pin_memory": True
"pin_memory": True,
"worker_init_fn": worker_init_fn,
"sampler": sampler,
"batch_sampler": batch_sampler,
"generator": g,
}

if hasattr(dataset, 'in_memory'):
args['num_workers'] = 0 if dataset.in_memory else num_workers
else:
args['num_workers'] = num_workers

if use_subset is not None:
from torch.utils.data.sampler import SubsetRandomSampler
if isinstance(use_subset, int):
Expand All @@ -311,10 +377,6 @@ def __init__(self,
args["sampler"] = SubsetRandomSampler(use_subset)
else:
args["shuffle"] = shuffle
super(H5DataLoader, self).__init__(
_H5Dataset(filepath,
in_memory=in_memory,
unpackbits=unpackbits,
sequence_key=sequence_key,
targets_key=targets_key),
**args)

super(H5DataLoader, self).__init__(dataset, **args)

9 changes: 4 additions & 5 deletions selene_sdk/samplers/file_samplers/mat_file_sampler.py
Original file line number Diff line number Diff line change
Expand Up @@ -240,13 +240,12 @@ def get_data_and_targets(self, batch_size, n_samples=None):
sequences_and_targets = []
targets_mat = []

count = 0
while count < n_samples:
sample_size = min(n_samples - count, batch_size)
seqs, tgts = self.sample(batch_size=sample_size)
for ix in range(0, n_samples, batch_size):
s = ix
e = min(ix+batch_size, n_samples)
seqs, tgts = self.sample(batch_size=e-s)
sequences_and_targets.append((seqs, tgts))
targets_mat.append(tgts)
count += sample_size

# TODO: should not assume targets are always integers
targets_mat = np.vstack(targets_mat).astype(float)
Expand Down
16 changes: 6 additions & 10 deletions selene_sdk/samplers/multi_sampler.py
Original file line number Diff line number Diff line change
Expand Up @@ -221,6 +221,7 @@ def sample(self, batch_size=1, mode=None):
except StopIteration:
#If DataLoader iterator reaches its length, reinitialize
self._iterators[mode] = iter(self._dataloaders[mode])

data, targets = next(self._iterators[mode])
return data.numpy(), targets.numpy()

Expand Down Expand Up @@ -260,16 +261,11 @@ def get_data_and_targets(self, batch_size, n_samples=None, mode=None):
self._set_batch_size(batch_size, mode=mode)
data_and_targets = []
targets_mat = []
count = batch_size
while count < n_samples:
data, tgts = self.sample(batch_size=batch_size, mode=mode)
data_and_targets.append((data, tgts))
targets_mat.append(tgts)
count += batch_size
remainder = batch_size - (count - n_samples)
data, tgts = self.sample(batch_size=remainder)
data_and_targets.append((data, tgts))
targets_mat.append(tgts)
for s in range(0, n_samples, batch_size):
e = min(n_samples, s+batch_size)
output = self.sample(batch_size=e-s, mode=mode)
data_and_targets.append(output)
targets_mat.append(output[1])
targets_mat = np.vstack(targets_mat)
return data_and_targets, targets_mat

Expand Down
22 changes: 15 additions & 7 deletions selene_sdk/samplers/online_sampler.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,9 +25,13 @@ class OnlineSampler(Sampler, metaclass=ABCMeta):
----------
reference_sequence : selene_sdk.sequences.Sequence
A reference sequence from which to create examples.
target_path : str
target_path : str or selene_sdk.targets.Target
Path to tabix-indexed, compressed BED file (`*.bed.gz`) of genomic
coordinates mapped to the genomic features we want to predict.
`target_path` will be loaded as a `GenomicFeatures` object.
Currently, `target_path` is also overloaded to accept a
`selene_sdk.targets.Target` object directly, either `GenomicFeatures`
or `GenomicFeaturesH5`.
features : list(str)
List of distinct features that we aim to predict.
seed : int, optional
Expand Down Expand Up @@ -213,9 +217,12 @@ def __init__(self,

self.reference_sequence = reference_sequence
self.n_features = len(self._features)
self.target = GenomicFeatures(
target_path, self._features,
feature_thresholds=feature_thresholds)
if isinstance(target_path, str):
self.target = GenomicFeatures(
target_path, self._features,
feature_thresholds=feature_thresholds)
else:
self.target = target_path
self._save_filehandles = {}

def get_feature_from_index(self, index):
Expand Down Expand Up @@ -336,9 +343,10 @@ def get_data_and_targets(self, batch_size, n_samples=None, mode=None):
elif n_samples is None and mode == "test":
n_samples = 640000

n_batches = int(n_samples / batch_size)
for _ in range(n_batches):
inputs, targets = self.sample(batch_size)
for ix in range(0, n_samples, batch_size):
s = ix
e = min(ix+batch_size, n_samples)
inputs, targets = self.sample(e-s)
sequences_and_targets.append((inputs, targets))
targets_mat = np.vstack([t for (s, t) in sequences_and_targets])
if mode in self._save_datasets:
Expand Down
Loading