-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merging strong scaling branch #91
Comments
I presume the changes are mostly in the former At the moment I'd like to focus on writing for my thesis. Perhaps I can take a look at this merge eventually, but please go ahead yourself you want it in quickly. |
Which branch is the correct one? I guess the first one?
|
Hi Martin,
It would be jlab/nesap_hacklatt_strongscale
(this is actuallty a merge of optimize-strongscale wihch undoes the receive queues,
and does in-order message reception — which brings back with it binary reproducibility).
It then splits threads working over forward and backward faces in each direction.
Best,
B
… On Jul 18, 2017, at 5:00 AM, Martin Ueding ***@***.***> wrote:
Which branch is the correct one? I guess the middle one?
• jlab/feature_receive_queues (already merged into devel)
• jlab/nesap_hacklatt_strongscale (Early 2017)
• jlab/optimize-strongscale (Mid 2015)
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or mute the thread.
--------------------------------------------------------------------------------------
Dr Balint Joo High Performance Computational Scientist
Jefferson Lab
12000 Jefferson Ave, Suite 3, MS 12B2, Room F217,
Newport News, VA 23606, USA
Tel: +1-757-269-5339, Fax: +1-757-269-5427
email: [email protected]
-------------------------------------------------------------------------------------
|
I am now merging this branch into
Doing a It still does not compile, and even when it does, the tests will probably fail. I'll keep working on it. |
Also these changes only benefit the Wilson and Wilson clover case. The twisted mass cases are not changed, therefore they will not see any improvement. Of course one could now try to do the same changes there as well, it would be just another four functions to change ({TM Wilson, TM clover} × {dslash, achimbdpsi}). The code is extremely similar, so I would say that refactoring this in order to separate the parallelization logic from the physical operator would make things much easier down the road. However, it would require some refactoring. |
For some reason, the Wilson Dslash does not work any more. Perhaps there something with the For today I'm done, perhaps I can finish this in the next couple of days. |
Did the tests on the |
The tests now work on my machine™, I did some mistake when manually merging, namely a One needs to test whether these still give the performance than they did in March. I have merged this manually and since the unit tests go through, it should at least give the correct result. However, I might have introduced some mistake with the OpenMP directives such that performance is not as expected. Could you please verify that the strong scaling still is improved and then merge Could we in the future please try to avoid these long-running feature branches and try to get features merged into |
This simply doesn't work in science in my experience... Work like this is done, often somewhat carelessly, in short spurts of a few days of activity (at hackathons, for example) and then necessarily lingers for months or even a year (conferences, teaching duties, physics work...). The prudent thing to do is to never make cosmetic changes to the whole code-base (such as reformatting all code) unless one is absolutely sure that all feature branches have been merged, as annoying as this may seem. Of course, in our case we also introduced a metric ton of new features into QPhiX... Excellent work on merging this, it must have been extremely challenging. Now we still need to check whether it's functionally correct in all circumstances though and check performance. In particular, it would be important to make these features optional (if possible), if they affect performance on few nodes substantially. Most of the time, however, we do work in the strong scaling limit with local lattices of the order of 8^3*16 or so... @azrael417 I would be interested to help in looking into background progression using tasks once this has landed. |
My mindset with software development is usually that we are a software “company” and that we want to have the best product for our end-users. However, our product are the papers that we write with the data that we incidentally generate with our software. So I need to make sure that
I am almost always falling into the boy-scouts pattern (leave the site cleaner than you found it) and not into the spec-ops pattern (make the needed changes and get out), therefore I rewrite comments, fix the formatting using
Mostly irritating because git got thoroughly confused that the strong-scale changes were done in two copies of the functions ( These changes are probably worthwhile to also have in the twisted mass implementations. One could change the two Dslash classes as well, but I wouldn't feel too good about that from a software engineering point of view. It seems that the parallelization is independent of the actual Dslash class in use. I have not thought enough about this, but all those team/thread/lattice index calculations should be factored out of the Dslash classes. Perhaps one has a virtual base class for the Dslash and then only specializes minor parts? That would be virtual functions calls, which might not be a performance problem; we are calling the kernels with a function pointer and it seems to be innocuous. Or we could introduce the CRTP here in order to avoid the virtual calls. |
Here is my thought on this.
In general I was trying to follow GitFlow, which is to have a very stable master, a slowly changing devel which
works with feature branhces.
However, I think one particular source of pain is that devel underwent a major (very positive) upheaval thanks to Martin’s and Bartosz’s
efforts and has changed a lot structurally (with all the elimination of duplication). I would hope that this level of
disruption, tho beneficial, is not going to be the status quo over the long term, and once the structure settles down these merges will
not be so painful.
That having been said, there is likely to be a lot of experimentation, for example in comms strategies. So one question is
how to have these best, and how invasive they are in the main code-body. For example i could imagine we could have
several implementations of comm.h which we could select with CMake, however, they may have different interface code
which would interfere in things like DPsiPlus/Minus. This can be sufficiently different code to have in different branches.
I think since these experiments are the projects of the individual authors, they should in principle be responsible for
keeping them up to date with the current devel branch as needed unless another factor comes into play (e.g. need for expediency,
or if someone really wants to test a different branch).
Ultimately we can follow the ‘trusted liutenants model’ to merge features onto the devel branch, but at the moment our group
is small, so I think we can all be trusted liutenants, especially if our branches pass tests on travis before merging. So this is all
fine as long as interfaces to the (e.g. to Chroma) don’t change, or change in a minor way, e.g. by adding extra optional params.
If the interfaces change a lot we should have a discussion on Slack as to how that affects everyone.
Does this seem a reasonable attitude?
Best,
B
PS: I don’t worry about cosmetics very much. Best, B
… On Jul 20, 2017, at 1:41 AM, Martin Ueding ***@***.***> wrote:
This simply doesn't work in science in my experience...
My mindset with software development is usually that we are a software “company” and that we want to have the best product for our end-users. However, our product are the papers that we write with the data that we incidentally generate with our software. So I need to make sure that devel is rather conservative such that these long-running branches are less of a problem.
The prudent thing to do is to never make cosmetic changes […] as annoying as this may seem.
I am almost always falling into the boy-scouts pattern (leave the site cleaner than you found it) and not into the spec-ops pattern (make the needed changes and get out), therefore I rewrite comments, fix the formatting using clang-format or improve on something else. I'll try to keep the branches more focused on a single issue.
it must have been extremely challenging
Mostly irritating because git got thoroughly confused that the strong-scale changes were done in two copies of the functions (DPsiPlus and DPsiMinus) but on devel there was only one copy of these functions. So the merge copied all the code that I had removed back in and then it had merge conflicts reaching over function scopes. git mergetool with gvimdiff (or vimdiff) is a help there because you get four windows (common base, our, theirs, merge result).
These changes are probably worthwhile to also have in the twisted mass implementations. One could change the two Dslash classes as well, but I wouldn't feel too good about that from a software engineering point of view. It seems that the parallelization is independent of the actual Dslash class in use. I have not thought enough about this, but all those team/thread/lattice index calculations should be factored out of the Dslash classes. Perhaps one has a virtual base class for the Dslash and then only specializes minor parts? That would be virtual functions calls, which might not be a performance problem; we are calling the kernels with a function pointer and it seems to be innocuous. Or we could introduce the CRTP here in order to avoid the virtual calls.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.
--------------------------------------------------------------------------------------
Dr Balint Joo High Performance Computational Scientist
Jefferson Lab
12000 Jefferson Ave, Suite 3, MS 12B2, Room F217,
Newport News, VA 23606, USA
Tel: +1-757-269-5339, Fax: +1-757-269-5427
email: [email protected]
-------------------------------------------------------------------------------------
|
That is exactly what I would like to avoid. Basically this would abuse git as a feature selection tool. Refactoring becomes impossible because one would have to refactor both branches at the same time. There would be tons of merge conflicts and I doubt that one could easily switch those branches at will. And since one would like to keep both around, those would be eternal feature branches. When there is a new feature on Also it does not scale. Say we want to have another aspect in two different ways, then there are four branches if one wants to have every possibility. One project that I worked on had “compile branches” because the some compiler flags were hard coded in the Instead I would suggest to have a modular architecture with a clearly defined interface boundary. The communication interface is owned by the Dslash classes because we don't want to have any downward dependency arrows. The Dslash class must not care about the details of the communication, it sits above that. So we must have the interface defined on the level of the Dslash classes and then we can provide various communication implementations, like so: This would be a clean object-oriented design pattern. If we don't want to use dynamic polymorphism (virtual functions), we can implement this with CRTP, but that is a detail. https://martinfowler.com/bliki/BranchByAbstraction.html
Definitely! Ideally we would have very regular communication in order to prevent working on the same aspect in different branches for weeks/months.
Both parties will be able to keep up with This is exactly what has happened with the I would like to have feature branches running only for a few days. If it is a new feature that is not ready yet, it can still be merged into
Unfortunately, for me it takes a lot of willpower to not brush up the cosmetics as I go. But I'll try to not tweak too much here and there and concentrate on the task at hand. |
@martin-ueding @azrael417 |
I've started experimenting with (optionally!) offloading comms to a single or multiple (not yet implemented) cores / threads. Current progress on this, based on On the aforementioned Marconi A2, if the measurements can be trusted, I get about 20% speed-up at the far end of the strong-scaling window (up to fluctuations which are unfortunately still a major problem on this machine) and a similar speed-up in some parts of the strong scaling window on a Broadwell-based machine with the same network. Combined with lower OpenMP overheads from coarse-grained threading, this might substantially improve our strong-scaling potential. It would be important to figure out, however, if I've encountered deadlocks or some race conditions in the strongscale branch, as mentioned in the comment above. |
Hi Bartosz,
There was the nesap-hacklatt-strongscale bramch which ran on Cori, but that is over a year old... It tried to split threads to send faces. I recall the X direction was tricky because for an Xh=4 lattice you could get races with some threads writing one end of a vector while others tried to write the other end. Strong scaling definitely needs more attention I would strongly encourage your efforts in this direction.
Thanks and bes wishes, B
Dr Balint Joo, Scientific Computing Group, Jefferson Lab, Tel: +1 757 269 5339 (Office)
…Sent from my mobile phone
On Oct 31, 2017, 6:27 AM, at 6:27 AM, Bartosz Kostrzewa ***@***.***> wrote:
@martin-ueding @azrael417
I've begun testing this branch on Marconi A2 (the KNL+OmniPath machine
in Italy) and my jobs lock up unfortunately and get killed because they
exceed the allotted wallclock time. Was the strongscale branch tested
throroughly for deadlocks?
--
You are receiving this because you commented.
Reply to this email directly or view it on GitHub:
https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_JeffersonLab_qphix_issues_91-23issuecomment-2D340721098&d=DwICaQ&c=lz9TcOasaINaaC3U7FbMev2lsutwpI4--09aP8Lu18s&r=SC-qvz5njMoFH6cliT5XZQ&m=LTqIIN_6LqBZrd3vsgu8iP8HEZSbZin09pmmGeyioMo&s=kQ2-4-TUKSatmEBtGWGGY_-E6PByKvJqCRSajjISGAQ&e=
|
Hi Balint, yes, Martin merged the I'll take a closer look and will try to figure out what went wrong. I agree that strong scaling is a top priority and given the results presented by the Tsukuba/Tokyo groups on the Oakforest-PACS machine, it seems that it should be possible to get much better performance with the right strategy. QPhiX node-level performance is really quite good with the right setup and I'm able to get well in excess of 300 GFlop/s (ppn=1, sz=2, omp_num_threads=136, balanced kmp_affinity, 68-core KNL) even for small lattices (below 16^4) and would likely be even better with reduced OMP overheads. It should thus be possible to sustain 200 Gflop/s per node even on large machine partitions, as long as overlapping comms are forced to work. |
Perhaps it would be worthwhile to compile the original version from March (?). It might be possible that I have introduced that deadlock while merging. The older version likely uses GNU Autotools and it might not have all the needed AVX512 kernels available. |
My merged version ( |
I have thought about having multiple communication strategies implemented in parallel. It seems to make more sense to generalize the Dslash operators (perhaps DiracParts is a better name for them given the A and A_inv terms?). Then one can define communication strategies that use those functions. This is what I have in mind right now: The |
The current level branch still uses receive queues. We found that they do not perform optimally in the strong scaling limit. We should merge the strong scaling branch modifications into devel. The issue is that this merge cannot be done automatically because the interface changed a lot after restructuring. Also, the strong scale branch has a coarse grained openmp model, so that makes the whole thing more tricky.
Once that is done, I could play around with background progression of comms using OMP tasks or similar tricks.
The text was updated successfully, but these errors were encountered: