-
Notifications
You must be signed in to change notification settings - Fork 258
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Signatures #75
base: master
Are you sure you want to change the base?
Signatures #75
Conversation
nibabel/arrayproxy.py
Outdated
""" Return stamp for current state of `self` | ||
|
||
The result somewhat uniquely identifies the state of the array proxy. | ||
It assumes that the underly ``self.file_like`` does not get modified. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
underlying
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On Sat, Jan 14, 2012 at 1:42 PM, Satrajit Ghosh
[email protected]
wrote:
@@ -22,6 +56,41 @@ def array(self):
return self._datadef _read_data(self):
- raise NotImplementedError
- fileobj = allopen(self.file_like)
- data = self.header.data_from_fileobj(fileobj)
- if isinstance(self.file_like, basestring): # filename
- fileobj.close()
- return data
+- def state_stamper(self, caller):
- """ Return stamp for current state of
self
+- The result somewhat uniquely identifies the state of the array proxy.
- It assumes that the underly
self.file_like
does not get modified.underlying
Thanks - fixed
@matthew-brett, did you notice this one needs a rebase before it can be merged? |
On Fri, Jan 20, 2012 at 4:24 AM, Fernando Perez
I did, but I was hoping for some more reviews. There didn't seem much |
Ah, OK. In IPython, we tend to ask PRs that get out of sync to be rebased first, so they can also be merged locally for testing and review. It's also possible that the rebase changes something important, so there's less chance of reviewing the wrong thing. Another workflow note: I find that it's best to put in the message at the top of this page (the PR description/intro box) a more detailed description of the intent of the PR; basically I'd suggest pasting the text you sent by email in the request for review right there. It will make the discussion page more self-contained and easier for others to jump into even if they don't have the email handy anymore (archived in gmail, whatever). This is a pretty complex chunk of code, so I may have missed something: is it fundamentally necessary for the |
Hi, On Sat, Jan 21, 2012 at 12:00 AM, Fernando Perez
I'm not very experienced with these yet, but the PR got out of sync
OK
As uuid is fine of course, but the question is how to form a uuid that stamp = (self.class, self.binaryblock) and for headers with 'extensions' (nifti) stamp = (self.class, self.binaryblock) + tuple(e.stamp_state() for if you see what I mean... |
Hey,
Basically, it seems to me that this is fundamentally a question of asking objects for their hash. If that's the case, then I think it can be done more simply; if not then I did misunderstand something... |
Hi, On Sat, Jan 21, 2012 at 8:38 PM, Fernando Perez
I'm not sure I understand either. A 'state' is just something such that it compares equal only to an Thus a state must only be able to be compared with other states with '==' Typically, custom classes will return their own states via self.stamp_state() The 'stamper' object has default stamping algorithms for objects that Any object not known to the default stamping algorithms and not Anyone implementing a new object or a subclass of an object with a Does that make sense? |
I was reviewing the code, first ran nosetests, and had multiple errors for .maybe_changed(), ####### im running python2.7, epd7.1.2 with branch acfbdabd91479b157ee23d6f852620cb382a460d note: this works arr = np.arange(5, dtype=np.int16) |
Here are some comments on the pull request. I am not a nibabel contributor, just a user, so this comes a bit from the peanut gallery, but as you asked for review on the nipy mailing list, I though this might help. My first comment is that it is a bit difficult for me to get a big picture of this functionality: I haven't seen an example, or a documentation section telling me how I use this. I see the comments at the top of the pull request, but I am not sure exactly what problem you are trying to solve, and how I, as an end user, should be seeing this functionality. Thus my comments may be off target. I am not going to comment on the technical details of the code: I don't know the nibabel codebase well enough. My general impression is that I am a bit scared off by the pull request. It is a big change: 1Kloc of code, for a library that's 16Kloc, touching to a lot of different parts of the code. It introduces a lot of cleverness, with getters, nested object oriented programming and proxies. These are things that I now try to avoid, after having learned from Mayavi that it's quite easy to find oneself the sole maintainer of 'clever' code. One of the drawbacks of having such code in nibabel is that everybody coding in nibabel should then be aware of the desire behavior, so that when implement new loaders, they do not break it. I am also weary that it might be quite easy to break such functionality. I haven't looked enough at details, but is the code resilient to people modifying views of the data? Also, an important point, does relying on the time-stamping mechanism impose to keep objects around that would leave file handles open? In functional neuroimaging people still work with thousands of file, and I have found that I can easily get a 'Too many files open' with neuroimaging code that keeps file handles. One way to have a clear answer to this last question is: has somebody stress-tested it in a 'production' setting, solving a decent size neuroimaging problem, such as running a group analysis? To help judging the added value of the code, do you have some specific applications in mind? I can see that a pipelining framework like nipype might benefit from this functionality. @satra, @chrisfilo, could you comment on whether you would be interested in using these time-stamps in nipype? I guess the question that would be most interested in, as a maintainer of nibabel, would be: are other people ready to commit to the maintenance of such functionality? As you might have guess, my gut feeling is that the ratio cost -in terms of complexity- to gain is not in the favor of adding such a code. My instinct tells me YAGNI: "you're not going to need it"; this is outside the 80/20 rule. To implement a similar functionality, i.e. telling me whether an object has changed or not, I would make sure that it pickles, and implement an md5 hash on its 'reduce'. The reason that I like this approach is that it implies only local changes (pickling is standard Python before) and is completely robust to any side effects. Also, it is surprising how little computing an md5 costs compared to other operations: In [2]: %time img = nibabel.load('Juelich-prob-2mm.nii.gz'); t = (img.get_data(), img.get_affine(), img.get_header()) CPU times: user 0.78 s, sys: 0.40 s, total: 1.18 s Wall time: 1.28 s In [3]: %timeit md5 = hashlib.md5(); md5.update(cPickle.dumps((img.get_affine(), img.get_header()), pickle.HIGHEST_PROTOCOL)); md5.digest() 1000 loops, best of 3: 290 us per loop In [4]: %timeit np.std(img.get_data(), axis=-1) 1 loops, best of 3: 3.56 s per loop In [5]: %timeit np.sum(img.get_data()**2, axis=-1) 1 loops, best of 3: 857 ms per loop Note that I have studied more accurately loading time for the 2mm Juelich atlas (http://gael-varoquaux.info/blog/?p=159) and I can confirm that the time to load from an empty cache is indeed around 1s. The reason that I am pointing this out is that due to disk cache effect, I have found that timing I/O is hard. For a bigger image, my timings weren't too useful, as the loading line forced my system to swap, and thus were extremely slow: In [6]: %time img = nibabel.load('Juelich-prob-1mm.nii.gz'); t = (img.get_data(), img.get_affine(), img.get_header()) CPU times: user 5.86 s, sys: 4.41 s, total: 10.27 s Wall time: 92.61 s In [7]: %timeit md5 = hashlib.md5(); md5.update(cPickle.dumps((img.get_affine(), img.get_header()), pickle.HIGHEST_PROTOCOL)); md5.digest() 1000 loops, best of 3: 282 us per loop Such timings tell me that the cost of computing an md5 hash is small in a larger setting, and thus I think that I can solve efficiently most of my needs for a stamp of data using an md5. It seems to me that 90% of nibabel usecases is just read, concat and write images with data and affine. The fact that nibabel makes this for many file formats efficiently is a huge benefit to the nipy community. I had the impression that the vision is to support more and more formats, and it's paying off (I am thinking of the MGH or MEG loader, which none of the original nipy or nibabel team would have envisaged). If it's actually the case, you don't want people volunteering to support new formats to need too much understanding. Hence this is why I lean toward a minimal design. |
90% of nibabel usecases is just read, concat and write images with data and affine. my 2 cents... |
It may feel akward that Alex is pretty much repeating the last paragraph |
Hi, On Thu, Feb 2, 2012 at 9:03 AM, Alexandre Gramfort
Well - the design of signatures requires that you know what you are |
This is a re-post of my reply to @GaelVaroquaux comments that seems to have been lost by the github email system.
The problem I am trying to solve is: >>> img = load('an_image.nii'); >>> # some code involving 'img' >>> if img.maybe_changed(): ... # maybe save with another filename or something
I am not quite sure what you mean by 'getters' but the proxies and nested classes are not central to this pull request (there are some edits to them, but they were there before). I guess you might have read the code too quickly to disentangle these from the main thrust of the code, which is rather simple, in my opinion.
Yes, that's a worry, but only for already-defined classes. If someone
Yes, it should be, and the tests test this.
No - I don't think so. If you created the image with filehandles then
No - because I wanted some feedback on the design before releasing out
Hum - there's a medium amount of maintenance to do, in my view.
Well - I implemented it because the nipypers seemed to think it was
There are two problems with that. The first is that I wanted to avoid But - I would be happy to be corrected if you think these problems can |
Hi, However it would be cool to have an ultra fast in memory mode. For simplification let's assume that all nodes would be pythonic. Nodes instead of exchanging filenames could be passing an image object in memory. This is where this new nibabel functionality could get handy. It will take a lot of effort to reach this functionality and we don't plan to implement this in the near future. There are many issues we would have to think about (for example what about non pythonic interfaces? how would that play along with parallel processing?). So it looks potentially useful, I really appreciate you think about nipype, but at the moment I cannot promise when we will attempt to make use of this. |
@cindeem : Thanks for running the tests ! Sorry to be slow to reply - jet-lag and old age. Yes, sorry, I seem to have got confused with commits on two branches, so I needed a fix from master to make the tests pass. I've rebased on master. The tests now pass for me. Do they for you? |
The discussion here seems to have stalled, so I will summarize in the hope that it will help us return to the problem. @satra did some helpful code review and found a problem with dictionaries which I believe is fixed @fperez asked me to clarify what I was trying to achieve. This made me expand the summary at the top of the pull request, and investigate python hashing in more detail. It turned out that hashing isn't designed to give a unique id for a particular python object, but only something that is fairly unique, such that it helps dictionary lookups. Thus two objects can have the same hash. @cindeem kindly tested the code and found it was broken. I think that is fixed now. @GaelVaroquaux wondered whether the mechanism might be too complicated and easy to break, and whether it was important to do this anyway. I partly agreed, but thought that a) the code probably was needed and b) it touches some complicated code, but the added code I thought was rather simple. @agramfort agreed with Gael and wondered if it would make maintenance harder. I thought it would, somewhat, but would not make it harder to add new image types @chrisfilo spoke up for nipype saying he thought the functionality was potentially useful, but probably not in the short term. So - I think we now have the situation where the need for this is moderate, and the cost is also moderate. If we could achieve lower complexity for the same goal, this would be ideal. Would any of y'all consider having a look at the code to see how this could be achieved? |
State stamps are values that define the state of an object, such that it will compare equal if the state is the same.
To allow comparison between them
Basic state stamp for headers. Any extensions
ArrayProxy was just a stub vaguely indicating the API. Fill out the stub with the Analyze implementation, and add state stamping. Add tests.
The arrayproxy module now implements an arrayproxy which works for both analyze and MGH format.
Image state defined by data, affine, header and file_map
Remove _load_cache and replace with _stored_state. Make public maybe_changed method and reset_changed method to checkpoint state and do limited comparisons of state.
Thanks to Satra for spotting this one. Some extension to tests
A note to say that I am leaning toward Gael's pickle idea. I hope to replace this pull request with one based on pickle hashes. |
I wrote some general code for https://github.com/mne-tools/mne-python/blob/master/mne/utils.py#L78 Here's an example use for one of our classes (note that https://github.com/mne-tools/mne-python/blob/master/mne/epochs.py#L360 |
At attempt to keep some track of whether an image has changed since - for example - it was loaded from disk. Specifically, I want to be able to do something like this:
In order to do this, I need some way of knowing whether any of the tuple (img.get_data(), img.get_affine(), img.get_header()) have changed.
In order to do this, imagine some function
get_state_stamp()
. This function returns a "state stamp":hdr_state = get_state_stamp(img.get_header())
. Let's say I do something to the image, and I get the header stamp againhdr_state2 = get_state_stamp(img.get_header())
. The needed property for the returned state is thathdr_state == hdr_state2
if and only if the header was the same before and after I did something to the image.What do I mean by "the same" in the above sentence? In most cases the object itself has to decide this. My
get_state_stamp()
function will check if the object to test (the header in this case) has astate_stamper()
method. If it does, then it calls that method so the object can define something unique to its state as the object understands it. If the object does not have such a method, then the function falls back to working on objects it knows about such as dictionaries, lists and so on.The user of
get_state_stamp()
is not allowed to depend on any particular value being returned fromget_state_stamp(obj)
, but only the property thatget_state_stamp(obj) == get_state_stamp(obj2)
if and only if the objectsobj, obj2
record themselves as being in same state.get_state_stamp(obj)
can also return something that is never equal to another state (for exampleget_state_stamp(obj) != get_state_stamp(obj)
if the object does not implement astate_stamper()
method or does implement such a method but the cost of calculating state would be too large.Note that the returned state is similar to the result from
hash(obj)
but in this case,obj
can be mutable and therefore does not implement__hash__
in the general case. Alsohash(obj)
is not guaranteed to be unique (as I understand it) even between two states of the same object, or to differ between two entirely different objects: http://stackoverflow.com/questions/9010222/how-can-python-dict-have-multiple-keys-with-same-hashSee also the comments at the top of nibabel/stampers.py in matthew-brett@3523624