Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enable overflow for accu keys, to_analysis_urls, and submission_cmd_args #40

Open
wants to merge 5 commits into
base: version/2.2
Choose a base branch
from

Conversation

ens-bwalts
Copy link
Contributor

No description provided.

@coveralls
Copy link

coveralls commented Jul 26, 2018

Coverage Status

Coverage increased (+3.7%) to 54.762% when pulling 8beee26 on experimental/overflow_fields into 121b02e on version/2.2.

@ens-bwalts ens-bwalts requested a review from muffato July 26, 2018 13:03
Copy link
Contributor

@muffato muffato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First of all, very good effort to push the fix back to 1.9 ! It's brave of you :)

I like how transparent the change would be for the users / pipelines. Can you add analysis_ctrl_rule.condition_analysis_url to the list of overflow-able fields ?

Finally, would you be able to add a test ?

sub overflow_limit {
return {
'key_signature' => 255,
'struct_name' => 255,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about overflowing struct_name. I see it as a static identifier / parameter name, like logic_name, pipeline_wide_parameters.param_name, etc. I also think they should be protected, but perhaps differently. Maybe introduce a size_limit method in the adaptors that define a maximum allowed size for some fields, and the base store method checks the values before doing the INSERT ? It might be possible to get those limits automatically through DBSQL::BaseAdaptor::_table_info_loader

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

struct_name needs protection because it's linked to to_analysis_url, which is itself protected. As for param_name, logic_name, and others that get set at init time - I'm not sure it's worth introducing a size limit which would throw an error, since the MySQL error that gets thrown is actually informative in this case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It actually depends how the MySQL server has been configured. mysql-ens-compara-prod-1 complains:

DBD::mysql::st execute failed: Data too long for column 'logic_name' at row 1 at /home/matthieu/workspace/src/hive/2.5/modules/Bio/EnsEMBL/Hive/DBSQL/StatementHandle.pm line 140.

whereas mysql-ens-reg-prod-4 silently truncates the logic name to 256 characters

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove struct_name ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have been experimenting (not pushed yet) with getting sizes in _table_info_loader(). Straightforward, but for many types COLUMN_SIZE returns null. These are types such as INT or TEXT, where either the size is very large, or that we are less concerned about. The bottom line being that the check against size_limit is not comprehensive without a lot of work (and somehow defining it for every column elsewhere in the code), but it does cover the cases that the ticket and this PR care about



} @$fields ];
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you override slicer here because AccumulatorAdaptor is the only NakedTableAdaptor that has overflowing and you wanted to keep NakedTableAdaptor's slicer simple and straightforward ?

It just felt surprising at the first glance. Maybe you can add a comment on top of NakedTableAdaptor::slicer indicating that should we need to move the overflow capability up, we can find it in AccumulatorAdaptor ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - I didn't want to impose the overhead on every inheriting Adaptor. However, the overhead is small, and for maintainability it's probably better to move it to the base class

@ens-bwalts ens-bwalts changed the base branch from version/1.9 to version/2.2 January 21, 2019 15:04
Copy link
Contributor

@muffato muffato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General comment: the adaptors should implement default_overflow_limit, not overflow_limit

t/02.api/overflow.t Outdated Show resolved Hide resolved
return [ @$hashref{@$fields} ];
my $overflow_limit = $self->overflow_limit();

return [ map { eval { my $value = $hashref->{$_};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know there is an eval in ObjectAdaptor too, but do we really need them ? I am concerned they could hide some errors

my $overflow_limit = $self->overflow_limit();

return [ map { eval { my $value = $hashref->{$_};
my $ol = $overflow_limit->{$_};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is almost exactly the same as in ObjectAdaptor. I would prefer this to be in BaseAdaptor but I'm not sure how to do it cleanly, clearly and efficiently.

sub overflow_limit {
return {
'key_signature' => 255,
'struct_name' => 255,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove struct_name ?

@ens-bwalts ens-bwalts force-pushed the experimental/overflow_fields branch from 7bd2fdc to 8beee26 Compare April 3, 2019 12:48
@mkszuba mkszuba requested a review from mira13 April 5, 2019 12:46
Copy link
Contributor

@muffato muffato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General comments across:

  1. Adaptors should implement default_overflow_limit, not overflow_limit
  2. Since we now get the maximum size from the database engine, the interesting part of the overflow-limit hash is really just the set of keys, since the values should be the same as the column size. Maybe on version/* branches, this should remain as it is in order to minimize the changes, but on master, what about replacing overflow_limit with overflow_list which would be the list of the columns allowed to overflow ? BaseAdaptor would read their maximum size from size_limit

On a completely separate note, what is your policy about column width ? There is a line in overflow.t with 161 characters, but line 628 in BaseAdaptor has been split (it would have been 118 characters otherwise). I find that the line-returns at lines 620 and 628 in BaseAdaptor break the flow

}
}

sub slicer {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this slicer. It's got all the functionality we want, and it avoids duplicating code and logic.
In principle, though, this method in BaseAdaptor should not know about sub-clases (NakedTable vs Object) and how to treat them differently ($is_object). A purist would ask to introduce a method $self->get_field_value($sliceable, $field) that would be implemented as a hash lookup in NakedTableAdaptor and a method call in ObjectAdaptor. I don't know yet if I want this or not

# the way slicer does column size checking. Likewise, PostgreSQL reports
# a user-defined TYPE_NAME for enums, rather than 'enum'. Therefore, if
# the DB is PostgreSQL, only set size_limit for varchars
unless (($driver eq 'pgsql') && !($column_type eq 'character varying')) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a matter of taste, but I find an if would be way clearer

if (($driver ne 'pgsql') || ($column_type eq 'character varying')) {

@@ -138,6 +139,26 @@ sub overflow_limit {
return $self->{_overflow_limit} || $self->default_overflow_limit();
}

=head2 size_limit
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The paradigm of BaseAdaptor is to have a default_ version of all those methods, that is only a getter, and then the normal version which is a getter/setter and falls back to the default_ version to set the initial value

# a user-defined TYPE_NAME for enums, rather than 'enum'. Therefore, if
# the DB is PostgreSQL, only set size_limit for varchars
unless (($driver eq 'pgsql') && !($column_type eq 'character varying')) {
$size_limit{$column_name} = $size_limit;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will set a size limit for all column types, incl. integers, enums and timestamps, but I'm not sure these are correctly enforceable:

  • For integers, we get the size of the longest possible string that fits in. For instance, for the standard INT (32 bits) which ranges from -2^32 to 2^32-1, that's 11 characters, but in fact, not all 11-digit integers can fit there.
  • ENUMs are given the size of their longest element
  • TIMESTAMPs are given 14 characters, which I find scary. My text editor tells me that 2019-04-05 15:47:14 has 19 characters, so the API would start throwing when storing timestamps ?? Although these columns are most often set to CURRENT_TIMESTAMP in a way that bypasses slicer, I fear a corner case whereby eHive stores a previously-fetched object that already has a timestamp and the adaptor would now fail.

I would suggest to only populate size_limit for text fields.

@@ -48,6 +48,13 @@ sub default_table_name {
}


sub overflow_limit {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's odd to me how many times you overload the overflow_limit sub.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants