Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GeneSplicer: improve argument processing #689

Merged
126 changes: 82 additions & 44 deletions GeneSplicer.pm
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,12 @@ limitations under the License.
=head1 SYNOPSIS

mv GeneSplicer.pm ~/.vep/Plugins
./vep -i variants.vcf --plugin GeneSplicer,[path_to_genesplicer_bin],[path_to_training_dir],[option1=value],[option2=value]
./vep -i variants.vcf --plugin GeneSplicer,binary=$GS/bin/linux/genesplicer,training=$GS/human
./vep -i variants.vcf --plugin GeneSplicer,binary=$GS/bin/linux/genesplicer,training=$GS/human,context=200,tmpdir=/mytmp

# VEP Docker/Singularity: if 'genesplicer' is a command available in $PATH,
# there is no need to specify the location of the binary
./vep -i variants.vcf --plugin GeneSplicer,training=$GS/human

=head1 DESCRIPTION

Expand All @@ -41,49 +46,58 @@ limitations under the License.
sequence included either side defaults to 100bp, but can be modified
by passing e.g. "context=50" as a parameter to the plugin.

Any predicted splicing regions that overlap the variant are reported
in the output with one of four states: no_change, diff, gain, loss
You will need to download the GeneSplicer binary and data from
ftp://ftp.ccb.jhu.edu/pub/software/genesplicer/GeneSplicer.tar.gz. Extract the
folder using:

tar -xzf GeneSplicer.tar.gz

There follows a "/"-separated string consisting of the following data:
GeneSplicer comes with precompiled binaries for multiple systems. If the
provided binaries do not run, compile `genesplicer` from source:

1) type (donor, acceptor)
2) coordinates (start-end)
3) confidence (Low, Medium, High)
4) score
```
cd $GS/sources
# if macOS, run this step
[[ $(uname -s) == "Darwin" ]] && perl -pi -e "s/^main /int main /" genesplicer.cpp
make
cd -
./vep [options] --plugin GeneSplicer,$GS/sources/genesplicer,$GS/human
```

Example: loss/acceptor/727006-727007/High/16.231924
Predicted splicing regions that overlap the variant are reported in the output
with a '/'-separated string (e.g., 'loss/acceptor/727006-727007/High/16.231924')
consisting of the following data by this order:

1) state (no_change, diff, gain, loss)
2) type (donor, acceptor)
3) coordinates (start-end)
4) confidence (Low, Medium, High)
5) score

If multiple sites are predicted, their reports are separated by ",".

For diff, the confidence and score for both the reference and alternate
sequences is reported as REF-ALT.

Example: diff/donor/621915-621914/Medium-Medium/7.020731-6.988368
sequences is reported as REF-ALT, such as
'diff/donor/621915-621914/Medium-Medium/7.020731-6.988368'.

Several key=value parameters can be modified in the the plugin string:

training : (mandatory) directory to species-specific training data, such as
`GeneSplicer/human`
binary : path to `genesplicer` binary (default: `genesplicer`)
context : change the amount of sequence added either side of
the variant (default: 100bp)
tmpdir : change the temporary directory used (default: /tmp)
tmpdir : change the temporary directory used (default: `/tmp`)
cache_size : change how many sequences' scores are cached in memory
(default: 50)

Example:
--plugin GeneSplicer,$GS/bin/linux/genesplicer,$GS/human,context=200,tmpdir=/mytmp

On some systems the binaries provided will not execute, but can be compiled from source:

cd $GS/sources
make
cd -
./vep [options] --plugin GeneSplicer,$GS/sources/genesplicer,$GS/human

On Mac OSX the make step is known to fail; the genesplicer.cpp file requires modification:
--plugin GeneSplicer,binary=$GS/bin/linux/genesplicer,training=$GS/human,context=200,tmpdir=/mytmp

cd $GS/sources
perl -pi -e "s/^main /int main /" genesplicer.cpp
make

When using VEP Docker/Singularity, the `binary` argument can be ommitted, as
the `genesplicer` command is exported in the $PATH variable and is thus
automatically detected by the plugin:
--plugin GeneSplicer,training=$GS/human,context=200,tmpdir=/mytmp

=cut

Expand All @@ -106,6 +120,23 @@ our %DEFAULTS = (
cache_size => 50,
);

sub _check_binary {
my $bin = shift;
olaaustine marked this conversation as resolved.
Show resolved Hide resolved
# if $bin is the genesplicer command, try to get its path to check if it exists
$bin = (`which $bin 2>&1` || '') unless -e bin;
nuno-agostinho marked this conversation as resolved.
Show resolved Hide resolved
die("ERROR: genesplicer binary not found\n") unless -e $bin;

my $test = `$bin 2>&1` || '';
die("ERROR: failed to run genesplicer binary:\n$test\n") unless $test =~ /^USAGE/;
= $bin;
nuno-agostinho marked this conversation as resolved.
Show resolved Hide resolved
}

sub _check_training_dir {
my $training_dir = shift;
die("ERROR: training directory not specified\n") unless $training_dir;
die("ERROR: training directory not found\n") unless -d $training_dir;
}

sub new {
my $class = shift;

Expand All @@ -116,32 +147,39 @@ sub new {

my $params = $self->params;

my $bin = shift @$params;
die("ERROR: genesplicer binary not specified\n") unless $bin;
die("ERROR: genesplicer binary not found\n") unless -e $bin;
my $test = `$bin 2>&1`;
die("ERROR: failed to run genesplicer binary:\n$test\n") unless $test =~ /^USAGE/;
$self->{_bin} = $bin;

my $training_dir = shift @$params;
die("ERROR: training directory not specified\n") unless $training_dir;
die("ERROR: training directory not found\n") unless -d $training_dir;
$self->{_training_dir} = $training_dir;

# defaults
$self->{'_param_'.$_} = $DEFAULTS{$_} for keys %DEFAULTS;

# REST API passes 1 as first param
shift @$params if $params->[0] && $params->[0] eq '1';

# set/override with user params
foreach my $param(@$params) {
next if $param eq '1'; # REST API passes 1 as a param

my ($key, $val) = split('=', $param);
die("ERROR: Failed to parse parameter $param\n") unless defined($key) && defined($val);

$self->{'_param_'.$key} = $val;
if (!defined $val) {
# for positional arguments, set 'binary' and 'training data'
if (!defined $self->{_bin}) {
$self->{_bin} = $param;
next;
} elsif (!defined $self->{_training_dir}) {
$self->{_training_dir} = $param;
next;
}
die("ERROR: Failed to parse parameter $param\n") unless defined($key);
}

if ($key eq 'binary') {
$self->{_bin} = $val;
} elsif ($key eq 'training') {
$self->{_training_dir} = $val;
} else {
$self->{'_param_'.$key} = $val;
}
}

$self->{_bin} ||= 'genesplicer';
_check_binary($self->{_bin});
_check_training_dir($self->{_training_dir});
return $self;
}

Expand Down