Conversation
|
You need to add yourself to the contributors file (alphabetical order) You say that you tested it to test that it works, how did you test it? As I recall you were using a input mesh file that had inter-grid maps. So did it not fail in the partitioning? Looks like you just changed the number of mpis used on existing tasks. The KGOs have changed, though I be suspicious of some of the checksums. There are a lot of differences, some only the last few digits, though some much larger. I think you need to provide some plots comparing 1 cpu vs multiple cpus to show that the fields are not significantly impacted. |
Added myself to the contributor list (added myself in a previous PR, but it has not been merged yet). I also fixed a detail in the new error message added to the code. I modified the existing lfric2lfric rose stem tests, so that three of the jobs run with multiple CPUS: ral_seuk-C32_lam_MG, clim_gal9-C24_C12, and oasis_clim_gal9-C24_C12. That is the reason why there are KGO changes. I am afraid the results change slightly with the partitioning. |
| help=In the lfric2lfric program, this variable indicates the | ||
| =mesh for which the described partitioning is taking place. | ||
| = | ||
| =The only valid values of this variable in lfric2lfric | ||
| =are 'source' to indicate the source grid, and 'destination' | ||
| =to indicate the destination grid. |
There was a problem hiding this comment.
| help=In the lfric2lfric program, this variable indicates the | |
| =mesh for which the described partitioning is taking place. | |
| = | |
| =The only valid values of this variable in lfric2lfric | |
| =are 'source' to indicate the source grid, and 'destination' | |
| =to indicate the destination grid. | |
| help=The designation of the mesh to which this | |
| =partitioning profile is to be applied. | |
| = | |
| =Recognised designations are: | |
| = * 'source' | |
| = * 'destination' |
| 'Partitioning parameters for the source mesh were not found, '// & | ||
| 'please specify a partitioning namelist with mesh_target=source.' |
There was a problem hiding this comment.
| 'Partitioning parameters for the source mesh were not found, '// & | |
| 'please specify a partitioning namelist with mesh_target=source.' | |
| 'Source mesh partitioning namelist (partitioning:source) not found.' |
If the code cannot find the required partitioning profile namelist, then that is all it should state, the code doesn't know why it can't be found, so it cannot suggest a solution. For example, if this error message was given for Marks problem, he would have checked the namelist file which was not at fault. This message also would have to be updated if a different key instance member was used.
The Rose metadata is enough for the user to be able to check their configuration settings.
|
|
||
| if (.not. configuration%namelist_exists('partitioning', 'destination')) then | ||
| write( log_scratch_space, '(A)' ) & | ||
| 'Partitioning parameters for the source mesh were not found, '// & |
There was a problem hiding this comment.
| 'Partitioning parameters for the source mesh were not found, '// & | |
| 'Destination mesh partitioning namelist (partitioning:destination) not found.' |
As before, the code doesn't know why the namelist is not found. Also the error message still referred to the source mesh.
I think you should edit the contributors list anyway, as it will block this change going on in the event that this PR is ready to commit first. Please refer to my comments in the Scitech review. Please provide some evidence that there is not significant changes in the output (plots) |
|
I wouldn't be concerned about speed at this juncture, My comments where based of the KGO differences and attribution of them. You haven't shown what the 1 CPU Oasis regridded method produces or is that not possible? If the differences are in the long/lats please provide example magnitudes? or the differences? |
We looked in the past that the map and the oasis method gave the same results with one CPU, and this is still true although I did not include the plot. We find the same here with 4 CPUs. I obtain lon/lat differences if I use ncdiff to generate the differences, but when I look at the actual longitude and latitude values in the lfric2lfric output using 1 or multiple CPUs, they are actually the same. I do not know why ncdiff behaves like this. |
ncdiff appears to checks differences at a lower tolerance than just dumping the output. Again, please post the differences in the lat lon values (a representative sample) |
There are no differences in lat lon values when comparing the values directly. There are some differences when comparing using ncdiff. I am confident that the lat lon values is not the problem. I suspect the problem is how lfric reads different types of data, I am preparing a test to check this theory. |
|
As requested, please post the differences in the values you have from ncdiff. |
Taken directly from ncdiff: Mesh2d_face_x = -0.7626276, -0.6977054, -0.6327863, -0.5678701, -0.502957, Mesh2d_face_y = 51.98213, 51.98115, 51.98013, 51.97908, 51.97799, 51.97687, |
|
No, that's not it, have you used nccmp? I want to see the differences in the lat lon values in the output. and similarly in the global mesh files you are using as input. |
So far I only used ncdiff. Differences with nccmp are, for the regional model: Variable Group Count Sum AbsSum Min Max Range Mean StdDev and for the global model: Variable Group Count Sum AbsSum Min Max Range Mean StdDev |
Sorry Juan, |
|
The output for 1 and multiple CPUs are on VDI: global to global 1 CPU 6 CPUs regional to regional 1 CPU 4 CPUs |
|
Glad that you appear to have found a fix for the issue. Although there are 2 issues here:
These are 2 different issues. This ticket is for the 1st issue. The fix to the output is technically a separate issue. If you want to progress this PR you may do so, but not change the rose-suite. If you want to add a multi-cpu test to the test-suite on this PR, then it can't go on without a fix as well. I'd suggest keeping the issues separate.
It keeps the scope of the PRs clear. |
Thanks for your comments. I would prefer having all changes in the same PR though for mainly two reasons:
Furthermore, the fix does not imply any changes to xml files, but to kernels. I will add all changes (they are not so many) and send test, and when you have a look at the final product you still consider we need two PRs, then I can easily split it. |
|
Following the reviewer's suggestions, this PR will now only consist in adding the capability of running in multiple CPUs. The trac.log has been updated. Differences of the test branch with the original branch are due only to the removal of old KGO files (as some of the KGO tests changed name). There are some tests that do not pass, but I guess they are present in the trunk as the changes in this PR only refer to lfric2lfric. |
@ukmo-juan-castillo You have re-requested a review with the view to only applying the changes that allow lfric2lfric to run with multiple cpus. However, the branch you have asked to be re-viewed on the PR has not changed since I last looked at it a month ago, have you pushed your changes to the branch associated with the PR? |
|
I updated as suggested by the reviewer, run again all rose stem tests, and updated trac.log in the PR description. |
|
I've run this As a note: |
mo-rickywong
left a comment
There was a problem hiding this comment.
SciTech review now fine.






PR Summary
Sci/Tech Reviewer: @mo-rickywong
Code Reviewer: @TeranIvy
Change the type of the partitioning key instance member for lfric2lfric, so that lfric2lfric can run with more than one CPU
Tagging Iva and Steve so they are aware of the suspect output from lfric2lfric on multiple CPUs
@TeranIvy , @stevemullerworth
Code Quality Checklist
Testing
No other tests where performed, just modified existing tests so that the new functionality could be tested. Therefore, there were KGO changes that were addressed.
trac.log
Test Suite Results - lfric_apps - test_lfric2lfric_multi_CPU/run1
Suite Information
Task Information
✅ succeeded tasks - 1515
Security Considerations
Performance Impact
AI Assistance and Attribution
Documentation
PSyclone Approval
Sci/Tech Review
Please detail how you tested the functionality? i.e. what modifications did you make to test it? doesn't seem like the rose-stem test should pass given the partitioning checks with the presence of inter-grid maps in the input mesh file.
KGOs changes are numerous and not all insignificant, please provide plots/evidence that output from multiple cpus is not significantly different from runs with multiple cpus (plots?).
SciTech Review: Fail Suspect results from runs with multiple CPUs. While this change may enable lfric2lfric to run with multiple CPUs, the output maybe known but not necessarily good
Code Review