Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Signatures #75

Open
wants to merge 9 commits into
base: master
Choose a base branch
from
Open

Signatures #75

wants to merge 9 commits into from

Conversation

matthew-brett
Copy link
Member

At attempt to keep some track of whether an image has changed since - for example - it was loaded from disk. Specifically, I want to be able to do something like this:

>>> img = load('an_image.nii');
>>> # some code involving 'img'
>>> if img.maybe_changed():
...           #  maybe save with another filename or something

In order to do this, I need some way of knowing whether any of the tuple (img.get_data(), img.get_affine(), img.get_header()) have changed.

In order to do this, imagine some function get_state_stamp(). This function returns a "state stamp": hdr_state = get_state_stamp(img.get_header()). Let's say I do something to the image, and I get the header stamp again hdr_state2 = get_state_stamp(img.get_header()). The needed property for the returned state is that hdr_state == hdr_state2 if and only if the header was the same before and after I did something to the image.

What do I mean by "the same" in the above sentence? In most cases the object itself has to decide this. My get_state_stamp() function will check if the object to test (the header in this case) has a state_stamper() method. If it does, then it calls that method so the object can define something unique to its state as the object understands it. If the object does not have such a method, then the function falls back to working on objects it knows about such as dictionaries, lists and so on.

The user of get_state_stamp() is not allowed to depend on any particular value being returned from get_state_stamp(obj), but only the property that get_state_stamp(obj) == get_state_stamp(obj2) if and only if the objects obj, obj2 record themselves as being in same state.

get_state_stamp(obj) can also return something that is never equal to another state (for example get_state_stamp(obj) != get_state_stamp(obj) if the object does not implement a state_stamper() method or does implement such a method but the cost of calculating state would be too large.

Note that the returned state is similar to the result from hash(obj) but in this case, obj can be mutable and therefore does not implement __hash__ in the general case. Also hash(obj) is not guaranteed to be unique (as I understand it) even between two states of the same object, or to differ between two entirely different objects: http://stackoverflow.com/questions/9010222/how-can-python-dict-have-multiple-keys-with-same-hash

See also the comments at the top of nibabel/stampers.py in matthew-brett@3523624

""" Return stamp for current state of `self`

The result somewhat uniquely identifies the state of the array proxy.
It assumes that the underly ``self.file_like`` does not get modified.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

underlying

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On Sat, Jan 14, 2012 at 1:42 PM, Satrajit Ghosh
[email protected]
wrote:

@@ -22,6 +56,41 @@ def array(self):
         return self._data

     def _read_data(self):

  •        raise NotImplementedError
  •        fileobj = allopen(self.file_like)
  •        data = self.header.data_from_fileobj(fileobj)
  •        if isinstance(self.file_like, basestring):  # filename
  •            fileobj.close()
  •        return data
    +
  •    def state_stamper(self, caller):
  •        """ Return stamp for current state of self
    +
  •        The result somewhat uniquely identifies the state of the array proxy.
  •        It assumes that the underly self.file_like does not get modified.

underlying

Thanks - fixed

@fperez
Copy link

fperez commented Jan 20, 2012

@matthew-brett, did you notice this one needs a rebase before it can be merged?

@matthew-brett
Copy link
Member Author

On Fri, Jan 20, 2012 at 4:24 AM, Fernando Perez
[email protected]
wrote:

@matthew-brett, did you notice this one needs a rebase before it can be merged?

I did, but I was hoping for some more reviews. There didn't seem much
point in rebasing before it's ready to merge.

@fperez
Copy link

fperez commented Jan 21, 2012

Ah, OK. In IPython, we tend to ask PRs that get out of sync to be rebased first, so they can also be merged locally for testing and review. It's also possible that the rebase changes something important, so there's less chance of reviewing the wrong thing.

Another workflow note: I find that it's best to put in the message at the top of this page (the PR description/intro box) a more detailed description of the intent of the PR; basically I'd suggest pasting the text you sent by email in the request for review right there. It will make the discussion page more self-contained and easier for others to jump into even if they don't have the email handy anymore (archived in gmail, whatever).

This is a pretty complex chunk of code, so I may have missed something: is it fundamentally necessary for the ..stamper methods to arbitrarily return different things? I see that there are several of them implemented in the PR and they each return something different. Further down, in the stampers.py docstring you have a detailed, abstract explanation of the idea, but I'm wondering if this can't be simplified... Perhaps returning a uuid that defines state, so you can simply compare whether it changed or not? I don't know, I'm just trying to think of an approach that might achieve the desired goals in a simpler manner, but I may well have missed understanding a key requirement.

@matthew-brett
Copy link
Member Author

Hi,

On Sat, Jan 21, 2012 at 12:00 AM, Fernando Perez
[email protected]
wrote:

Ah, OK.  In IPython, we tend to ask PRs that get out of sync to be rebased first, so they can also be merged locally for testing and review.  It's also possible that the rebase changes something important, so there's less chance of reviewing the wrong thing.

I'm not very experienced with these yet, but the PR got out of sync
after the small reviews started. I think it's right to say that
rebasing loses the commit- and line- specific review comments.
Actually, I was wondering how you dealt with that in ipython from your
green-button thread.

Another workflow note: I find that it's best to put in the message at the top a more detailed description of the intent of the PR; basically I'd suggest pasting the text you sent by email in the request for review right there.  It will make the discussion page more self-contained and easier for others to jump into even if they don't have the email handy anymore (archived in gmail, whatever).

OK

This is a pretty complex chunk of code, so I may have missed something: is it fundamentally necessary for the ..stamper methods to arbitrarily return different things?  I see that there are several of them implemented in the PR and they each return something different.  Further down, in the stampers.py docstring you have a detailed, abstract explanation of the idea, but I'm wondering if this can't be simplified... Perhaps returning a uuid that defines state, so you can simply compare whether it changed or not?  I don't know, I'm just trying to think of an approach that might achieve the desired goals in a simpler manner, but I may well have missed understanding a key requirement.

As uuid is fine of course, but the question is how to form a uuid that
is unique to the state of the object and across classes of objects.
The stampers are returning, generally, stamps that are tuples of
things that are relatively small, and relate to the internal state of
the object. For example, a header can be just a record array, but
it can be a record array with 0 or more 'extensions'. The same record
array could be (for example) from a header of class Spm99Analyze or a
header of class AnalyzeHeader. Thus the 'stamp' has to contain
comparable ( meaning '==' ) stuff that identifies the class and the
contents of the record array for those cases,

stamp = (self.class, self.binaryblock)

and for headers with 'extensions' (nifti)

stamp = (self.class, self.binaryblock) + tuple(e.stamp_state() for
e in self.extensions)

if you see what I mean...

@fperez
Copy link

fperez commented Jan 21, 2012

Hey,

  • Re. PR rebasing: github seems to be pretty good at saving static copies of the discussion on the page, so that even if the code changes those fragments remain visible. At least I think that's the case, b/c I don't recall having had any problems with PR discussions going incomprehensible after a rebase due to loss of intermediate info. We do it so often that I imagine we'd have seen problems already.
  • uuid: well, my take on it is that by having a variable-return API, this puts the burden of understanding how to interpret the stamp on the caller, and not on the object itself. But it's the object's author who should know better what is meaningful change and what isn't. So it seems to me that the stamper() should return some opaque value that the caller can simply compare with a previous one for equality, simply getting a True/False response. The object itself would be the one tasked with implementing the construction of a hash/uuid for itself.

Basically, it seems to me that this is fundamentally a question of asking objects for their hash. If that's the case, then I think it can be done more simply; if not then I did misunderstand something...

@matthew-brett
Copy link
Member Author

Hi,

On Sat, Jan 21, 2012 at 8:38 PM, Fernando Perez
[email protected]
wrote:

Hey,

  • Re. PR rebasing: github seems to be pretty good at saving static copies of the discussion on the page, so that even if the code changes those fragments remain visible.  At least I think that's the case, b/c I don't recall having had any problems with PR discussions going incomprehensible after a rebase due to loss of intermediate info.  We do it so often that I imagine we'd have seen problems already.
  • uuid: well, my take on it is that by having a variable-return API, this puts the burden of understanding how to interpret the stamp on the caller, and not on the object itself.  But it's the object's author who should know better what is meaningful change and what isn't.  So it seems to me that the stamper() should return some opaque value that the caller can simply compare with a previous one for equality, simply getting a True/False response.  The object itself would be the one tasked with implementing the construction of a hash/uuid for itself.

Basically, it seems to me that this is fundamentally a question of asking objects for their hash.  If that's the case, then I think it can be done more simply; if not then I did misunderstand something...


Reply to this email directly or view it on GitHub:
#75 (comment)

I'm not sure I understand either.

A 'state' is just something such that it compares equal only to an
object of the same class in the same state.

Thus a state must only be able to be compared with other states with '=='

Typically, custom classes will return their own states via self.stamp_state()

The 'stamper' object has default stamping algorithms for objects that
don't implement self.stamp_state()

Any object not known to the default stamping algorithms and not
implementing self.stamp_state() is Unknown() which compares not equal
to everything including itself.

Anyone implementing a new object or a subclass of an object with a
self.stamp_state() method will have to work out what to return to make
something - such that it compares equal only to the state for a an
object of the same class and in the same state.

Does that make sense?

@cindeem
Copy link
Contributor

cindeem commented Feb 1, 2012

I was reviewing the code, first ran nosetests, and had multiple errors for .maybe_changed(),
both running nosetests and running example test in ipython

#######
img_klass = nibabel.spatialimages.SpatialImage
arr = np.arange(5, dtype=np.int16)
aff = np.eye(4)
img = img_klass(arr, aff)
hdr = img.get_header()
img = img_klass(arr, aff,hdr)
assert_false(img.maybe_changed()) # this one passed
aff[0,0] = 1.1
assert_false(img.maybe_changed()) # this one failed

im running python2.7, epd7.1.2 with branch acfbdabd91479b157ee23d6f852620cb382a460d
nose 1.0.0
numpy 1.6.1

note: this works

arr = np.arange(5, dtype=np.int16)
aff = np.eye(4)
img = img_klass(arr.copy(), aff.copy())
aff[0,0] = 1.1
img.maybe_changed()

@GaelVaroquaux
Copy link
Member

Here are some comments on the pull request. I am not a nibabel contributor, just a user, so this comes a bit from the peanut gallery, but as you asked for review on the nipy mailing list, I though this might help.

My first comment is that it is a bit difficult for me to get a big picture of this functionality: I haven't seen an example, or a documentation section telling me how I use this. I see the comments at the top of the pull request, but I am not sure exactly what problem you are trying to solve, and how I, as an end user, should be seeing this functionality. Thus my comments may be off target.

I am not going to comment on the technical details of the code: I don't know the nibabel codebase well enough. My general impression is that I am a bit scared off by the pull request. It is a big change: 1Kloc of code, for a library that's 16Kloc, touching to a lot of different parts of the code. It introduces a lot of cleverness, with getters, nested object oriented programming and proxies. These are things that I now try to avoid, after having learned from Mayavi that it's quite easy to find oneself the sole maintainer of 'clever' code. One of the drawbacks of having such code in nibabel is that everybody coding in nibabel should then be aware of the desire behavior, so that when implement new loaders, they do not break it.

I am also weary that it might be quite easy to break such functionality. I haven't looked enough at details, but is the code resilient to people modifying views of the data? Also, an important point, does relying on the time-stamping mechanism impose to keep objects around that would leave file handles open? In functional neuroimaging people still work with thousands of file, and I have found that I can easily get a 'Too many files open' with neuroimaging code that keeps file handles. One way to have a clear answer to this last question is: has somebody stress-tested it in a 'production' setting, solving a decent size neuroimaging problem, such as running a group analysis?

To help judging the added value of the code, do you have some specific applications in mind? I can see that a pipelining framework like nipype might benefit from this functionality. @satra, @chrisfilo, could you comment on whether you would be interested in using these time-stamps in nipype? I guess the question that would be most interested in, as a maintainer of nibabel, would be: are other people ready to commit to the maintenance of such functionality?

As you might have guess, my gut feeling is that the ratio cost -in terms of complexity- to gain is not in the favor of adding such a code. My instinct tells me YAGNI: "you're not going to need it"; this is outside the 80/20 rule.

To implement a similar functionality, i.e. telling me whether an object has changed or not, I would make sure that it pickles, and implement an md5 hash on its 'reduce'. The reason that I like this approach is that it implies only local changes (pickling is standard Python before) and is completely robust to any side effects. Also, it is surprising how little computing an md5 costs compared to other operations:

In [2]: %time img = nibabel.load('Juelich-prob-2mm.nii.gz'); t = (img.get_data(), img.get_affine(), img.get_header())
CPU times: user 0.78 s, sys: 0.40 s, total: 1.18 s
Wall time: 1.28 s

In [3]: %timeit md5 = hashlib.md5(); md5.update(cPickle.dumps((img.get_affine(), img.get_header()), pickle.HIGHEST_PROTOCOL)); md5.digest()
1000 loops, best of 3: 290 us per loop

In [4]: %timeit np.std(img.get_data(), axis=-1)
1 loops, best of 3: 3.56 s per loop

In [5]: %timeit np.sum(img.get_data()**2, axis=-1)
1 loops, best of 3: 857 ms per loop

Note that I have studied more accurately loading time for the 2mm Juelich atlas (http://gael-varoquaux.info/blog/?p=159) and I can confirm that the time to load from an empty cache is indeed around 1s. The reason that I am pointing this out is that due to disk cache effect, I have found that timing I/O is hard.

For a bigger image, my timings weren't too useful, as the loading line forced my system to swap, and thus were extremely slow:

In [6]: %time img = nibabel.load('Juelich-prob-1mm.nii.gz'); t = (img.get_data(), img.get_affine(), img.get_header())
CPU times: user 5.86 s, sys: 4.41 s, total: 10.27 s
Wall time: 92.61 s

In [7]: %timeit md5 = hashlib.md5(); md5.update(cPickle.dumps((img.get_affine(), img.get_header()), pickle.HIGHEST_PROTOCOL)); md5.digest()
1000 loops, best of 3: 282 us per loop

Such timings tell me that the cost of computing an md5 hash is small in a larger setting, and thus I think that I can solve efficiently most of my needs for a stamp of data using an md5.

It seems to me that 90% of nibabel usecases is just read, concat and write images with data and affine. The fact that nibabel makes this for many file formats efficiently is a huge benefit to the nipy community. I had the impression that the vision is to support more and more formats, and it's paying off (I am thinking of the MGH or MEG loader, which none of the original nipy or nibabel team would have envisaged). If it's actually the case, you don't want people volunteering to support new formats to need too much understanding. Hence this is why I lean toward a minimal design.

@agramfort
Copy link
Contributor

90% of nibabel usecases is just read, concat and write images with data and affine.
Doing this for many file formats efficiently is the first benefit of nibabel whose vision
is to support more and more formats. If it's actually the case you don't want people
volunteering to support new formats to understand too much hence have a minimal
design.

my 2 cents...

@GaelVaroquaux
Copy link
Member

It may feel akward that Alex is pretty much repeating the last paragraph
of my comment. I should point out that we discussed by email a bit the
review I made, this explains the similarity :)

@matthew-brett
Copy link
Member Author

Hi,

On Thu, Feb 2, 2012 at 9:03 AM, Alexandre Gramfort
[email protected]
wrote:

90% of nibabel usecases is just read, concat and write images with data and affine.
Doing this for many file formats efficiently is the first benefit of nibabel whose vision
is to support more and more formats. If it's actually the case you don't want people
volunteering to support new formats to understand too much hence have a minimal
design.

Well - the design of signatures requires that you know what you are
doing, when you subclass a class for which a signature is defined.
So, if you subclass Analyze, and you add something that you think the
signature should keep track of, you need to write a signature method
for it. For other classes the signature default to Unknown and
therefore you don't need to worry about it.

@matthew-brett
Copy link
Member Author

This is a re-post of my reply to @GaelVaroquaux comments that seems to have been lost by the github email system.

My first comment is that it is a bit difficult for me to get a big picture of this functionality: I haven't seen an example, or
a documentation section telling me how I use this. I see the comments at the top of the pull request, but I am not sure
exactly what problem you are trying to solve, and how I, as an end user, should be seeing this functionality. Thus my
comments may be off target.

The problem I am trying to solve is:

>>> img = load('an_image.nii');
>>> # some code involving 'img'
>>> if img.maybe_changed():
...           #  maybe save with another filename or something

I am not going to comment on the technical details of the code: I don't know the nibabel codebase well enough. My
general impression is that I am a bit scared off by the pull request. It is a big change: 1Kloc of code, for a library that's
16Kloc, touching to a lot of different parts of the code. It introduces a lot of cleverness, with getters, nested object
oriented programming and proxies.

I am not quite sure what you mean by 'getters' but the proxies and nested classes are not central to this pull request (there are some edits to them, but they were there before). I guess you might have read the code too quickly to disentangle these from the main thrust of the code, which is rather simple, in my opinion.

. One of the drawbacks of having such code in nibabel is that everybody coding in nibabel should then be aware of the
desire behavior, so that when implement new loaders, they do not break it.

Yes, that's a worry, but only for already-defined classes. If someone
implementing a new class doesn't want to define these, the default
behavior for img.maybe_changed() is to return Unknown()

I am also weary that it might be quite easy to break such functionality. I haven't looked enough at details, but is the
code resilient to people modifying views of the data?

Yes, it should be, and the tests test this.

Also, an important point, does relying on the time-stamping mechanism impose to keep objects around that would leave
file handles open?

No - I don't think so. If you created the image with filehandles then
there will be an extra reference to these filehandles within the
object (unless you store the state_stamp outside the image for some
reason). If you created the image with filenames, these stay as
filenames.

One way to have a clear answer to this last question is: has somebody stress-tested it in a 'production' setting, solving
a decent size neuroimaging problem, such as running a group analysis?

No - because I wanted some feedback on the design before releasing out
into the world for stress testing.

To help judging the added value of the code, do you have some specific applications in mind? I can see that a pipelining
framework like nipype might benefit from this functionality. @satra, @chrisfilo, could you comment on whether you
would be interested in using these time-stamps in nipype? I guess the question that would be most interested in, as a
maintainer of nibabel, would be: are other people ready to commit to the maintenance of such functionality?

Hum - there's a medium amount of maintenance to do, in my view.
Personally I find the code rather simple and it was fast to write, but
I think it would need some eyes on it (the actual code) to get a feel for whether my
feeling is reasonable.

As you might have guess, my gut feeling is that the ratio cost -in terms of complexity- to gain is not in the favor of
adding such a code. My instinct tells me YAGNI: "you're not going to need it"; this is outside the 80/20 rule.

Well - I implemented it because the nipypers seemed to think it was
important. I agree, if that isn't the case, it may well not be worth
it. @satra @chrisfilo - if you guys are interested in this, speak up.

To implement a similar functionality, i.e. telling me whether an object has changed or not, I would make sure that it
pickles, and implement an md5 hash on its 'reduce'. The reason that I like this approach is that it implies only local
changes (pickling is standard Python before) and is completely robust to any side effects.

There are two problems with that. The first is that I wanted to avoid
a forced md5 check on loading large arrays as you are implying. The
second is that things like header extensions can change internally
while in fact having the same meaning. I mean, just accessing the
extension could read the fields out into the object, but not change
any values. Thus is would look as if it had changed when it had not.
It seemed more flexible to allow the object itself to specify when it
had changed (and default to "don't know" if it didn't have an idea of
how to check that).

But - I would be happy to be corrected if you think these problems can
in fact be avoided by md5 hashes on pickles.

@chrisgorgo
Copy link
Member

Hi,
very interesting functionality. Kudos for implementing this. So in nipype at the moment all image related operations are file based. A Node reads a file does it's magic and saves a file. As you can imagine I/O is one of the most time consuming components.

However it would be cool to have an ultra fast in memory mode. For simplification let's assume that all nodes would be pythonic. Nodes instead of exchanging filenames could be passing an image object in memory. This is where this new nibabel functionality could get handy. It will take a lot of effort to reach this functionality and we don't plan to implement this in the near future. There are many issues we would have to think about (for example what about non pythonic interfaces? how would that play along with parallel processing?).

So it looks potentially useful, I really appreciate you think about nipype, but at the moment I cannot promise when we will attempt to make use of this.

@matthew-brett
Copy link
Member Author

@cindeem :

Thanks for running the tests ! Sorry to be slow to reply - jet-lag and old age.

Yes, sorry, I seem to have got confused with commits on two branches, so I needed a fix from master to make the tests pass. I've rebased on master. The tests now pass for me. Do they for you?

@matthew-brett
Copy link
Member Author

The discussion here seems to have stalled, so I will summarize in the hope that it will help us return to the problem.

@satra did some helpful code review and found a problem with dictionaries which I believe is fixed

@fperez asked me to clarify what I was trying to achieve. This made me expand the summary at the top of the pull request, and investigate python hashing in more detail. It turned out that hashing isn't designed to give a unique id for a particular python object, but only something that is fairly unique, such that it helps dictionary lookups. Thus two objects can have the same hash.

@cindeem kindly tested the code and found it was broken. I think that is fixed now.

@GaelVaroquaux wondered whether the mechanism might be too complicated and easy to break, and whether it was important to do this anyway. I partly agreed, but thought that a) the code probably was needed and b) it touches some complicated code, but the added code I thought was rather simple.

@agramfort agreed with Gael and wondered if it would make maintenance harder. I thought it would, somewhat, but would not make it harder to add new image types

@chrisfilo spoke up for nipype saying he thought the functionality was potentially useful, but probably not in the short term.

So - I think we now have the situation where the need for this is moderate, and the cost is also moderate.

If we could achieve lower complexity for the same goal, this would be ideal.

Would any of y'all consider having a look at the code to see how this could be achieved?

State stamps are values that define the state of an object, such that it
will compare equal if the state is the same.
To allow comparison between them
Basic state stamp for headers. Any extensions
ArrayProxy was just a stub vaguely indicating the API.  Fill out the
stub with the Analyze implementation, and add state stamping.  Add
tests.
The arrayproxy module now implements an arrayproxy which works for both
analyze and MGH format.
Image state defined by data, affine, header and file_map
Remove _load_cache and replace with _stored_state.
Make public maybe_changed method and reset_changed method to checkpoint
state and do limited comparisons of state.
Thanks to Satra for spotting this one.

Some extension to tests
@matthew-brett
Copy link
Member Author

A note to say that I am leaning toward Gael's pickle idea. I hope to replace this pull request with one based on pickle hashes.

@larsoner
Copy link
Contributor

I wrote some general code for mne-python that implements a __hash__ for our objects, but is actually quite general. I'd be happy to port it over here if you want. It only takes ~20 lines of code:

https://github.com/mne-tools/mne-python/blob/master/mne/utils.py#L78

Here's an example use for one of our classes (note that self.info is a dict with a ton of different entries like ndarray, lists of strings, other dicts, etc.):

https://github.com/mne-tools/mne-python/blob/master/mne/epochs.py#L360

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants