UFS_WEATHER_MODEL HR.v4 cannot be run with fully packed nodes on Gaea C5 at C1152 resolution #2540

GeorgeVandenberghe-NOAA · 2024-12-18T20:32:09Z

When ufs-weather-model (tested is hr.v4 ) is run at C1152 resolution on Gaea C5 with ESMF managed threading, it hangs or fails when run 128 MPI ranks per node. ESMF managed threading requires 128 ranks per node for full use of the node because it disables traditional threading so we cannot run C1152 with ESMF managed threading. It is possible to get full use of the node by running with traditional threading and plural threads per task (two threads, 64 ranks per node or four threads 32 ranks per node) but other components which do not thread well then use their nodes inefficiently. It is hypothesizes the 2GB/core memory limit is insufficent to run this configuration fully packed, 128 ranks per node but then this begs the question, WHAT is using so much memory even at very high rank counts.? It has failed with 256 ranks per I/O task and two ESMF threads, and 512 ranks per I/O task and two ESMF threads.

GeorgeVandenberghe-NOAA · 2024-12-18T20:32:55Z

the issue can be mitigated by running with traditional threads and 64 or fewer ranks per node.

theurich · 2024-12-19T18:18:54Z

@GeorgeVandenberghe-NOAA Have you been able to attempt ESMF-managed threading runs at full core capacity with the custom Verbosity setting as per:

# EARTH #
EARTH_component_list: MED ATM OCN ICE WAV
EARTH_attributes::
  Verbosity = 32563
::

This should dump a lot of memory tracing information into the ESMF PET* log files. It might give us a clue as to where/why memory pressure is growing to the point of failure. If you have PET* log files with that extra info, I would like to look at them. Thanks!

GeorgeVandenberghe-NOAA · 2024-12-19T18:46:48Z

Thanks. I will try that this afternoon.

…

On Thu, Dec 19, 2024 at 6:19 PM Gerhard Theurich ***@***.***> wrote: @GeorgeVandenberghe-NOAA <https://github.com/GeorgeVandenberghe-NOAA> Have you been able to attempt ESMF-managed threading runs at full core capacity with the custom Verbosity setting as per: # EARTH # EARTH_component_list: MED ATM OCN ICE WAV EARTH_attributes:: Verbosity = 32563 :: This should dump a lot of memory tracing information into the ESMF PET* log files. It might give us a clue as to where/why memory pressure is growing to the point of failure. If you have PET* log files with that extra info, I would like to look at them. Thanks! — Reply to this email directly, view it on GitHub <#2540 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ANDS4FSNALOR6IPYV46LLHD2GME2JAVCNFSM6AAAAABT3QB4ASVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNJVGQ4TKMRQGI> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- George W Vandenberghe *Lynker Technologies at * NOAA/NWS/NCEP/EMC 5830 University Research Ct., Rm. 2141 College Park, MD 20740 ***@***.*** 301-683-3769(work) 3017751547(cell)

GeorgeVandenberghe-NOAA · 2024-12-19T19:16:46Z

A run CWD is on /gpfs/f5/scratch/gwv/hr4j/da. Output is oo and error is ee. Where are these memory statistics written?

…

On Thu, Dec 19, 2024 at 6:19 PM Gerhard Theurich ***@***.***> wrote: @GeorgeVandenberghe-NOAA <https://github.com/GeorgeVandenberghe-NOAA> Have you been able to attempt ESMF-managed threading runs at full core capacity with the custom Verbosity setting as per: # EARTH # EARTH_component_list: MED ATM OCN ICE WAV EARTH_attributes:: Verbosity = 32563 :: This should dump a lot of memory tracing information into the ESMF PET* log files. It might give us a clue as to where/why memory pressure is growing to the point of failure. If you have PET* log files with that extra info, I would like to look at them. Thanks! — Reply to this email directly, view it on GitHub <#2540 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ANDS4FSNALOR6IPYV46LLHD2GME2JAVCNFSM6AAAAABT3QB4ASVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNJVGQ4TKMRQGI> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- George W Vandenberghe *Lynker Technologies at * NOAA/NWS/NCEP/EMC 5830 University Research Ct., Rm. 2141 College Park, MD 20740 ***@***.*** 301-683-3769(work) 3017751547(cell)

GeorgeVandenberghe-NOAA · 2024-12-19T19:30:34Z

What do I toggle to get those PET logs turned on? On Thu, Dec 19, 2024 at 7:16 PM George Vandenberghe - NOAA Affiliate < ***@***.***> wrote:

…

A run CWD is on /gpfs/f5/scratch/gwv/hr4j/da. Output is oo and error is ee. Where are these memory statistics written? On Thu, Dec 19, 2024 at 6:19 PM Gerhard Theurich ***@***.***> wrote: > @GeorgeVandenberghe-NOAA <https://github.com/GeorgeVandenberghe-NOAA> > Have you been able to attempt ESMF-managed threading runs at full core > capacity with the custom Verbosity setting as per: > > # EARTH # > EARTH_component_list: MED ATM OCN ICE WAV > EARTH_attributes:: > Verbosity = 32563 > :: > > This should dump a lot of memory tracing information into the ESMF PET* > log files. It might give us a clue as to where/why memory pressure is > growing to the point of failure. If you have PET* log files with that extra > info, I would like to look at them. Thanks! > > — > Reply to this email directly, view it on GitHub > <#2540 (comment)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/ANDS4FSNALOR6IPYV46LLHD2GME2JAVCNFSM6AAAAABT3QB4ASVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNJVGQ4TKMRQGI> > . > You are receiving this because you were mentioned.Message ID: > ***@***.***> > -- George W Vandenberghe *Lynker Technologies at * NOAA/NWS/NCEP/EMC 5830 University Research Ct., Rm. 2141 College Park, MD 20740 ***@***.*** 301-683-3769(work) 3017751547(cell)

-- George W Vandenberghe *Lynker Technologies at * NOAA/NWS/NCEP/EMC 5830 University Research Ct., Rm. 2141 College Park, MD 20740 ***@***.*** 301-683-3769(work) 3017751547(cell)

theurich · 2024-12-19T19:33:32Z

What do I toggle to get those PET logs turned on?

In ufs.configure:

# ESMF #
logKindFlag:            ESMF_LOGKIND_MULTI
globalResourceControl:  true

GeorgeVandenberghe-NOAA · 2024-12-19T20:44:25Z

Done. A run with the PET logs is on /gpfs/f5/scratch/gwv/hr4j/da

…

On Thu, Dec 19, 2024 at 7:33 PM Gerhard Theurich ***@***.***> wrote: What do I toggle to get those PET logs turned on? # ESMF # logKindFlag: ESMF_LOGKIND_MULTI globalResourceControl: true — Reply to this email directly, view it on GitHub <#2540 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ANDS4FTSAFAPOVX6T7NJLST2GMNSFAVCNFSM6AAAAABT3QB4ASVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNJVGYZDMOJZGU> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- George W Vandenberghe *Lynker Technologies at * NOAA/NWS/NCEP/EMC 5830 University Research Ct., Rm. 2141 College Park, MD 20740 ***@***.*** 301-683-3769(work) 3017751547(cell)

theurich · 2024-12-20T02:15:48Z

@GeorgeVandenberghe-NOAA I looked at the memory tracing, and it looks to me that the run dies because of memory pressure on the nodes that run the WAV component. WAV in this run is setup to execute on 998 PETs. Does the WAV configuration work on that number of PETs under traditional threading?

GeorgeVandenberghe-NOAA · 2024-12-20T02:21:57Z

Yes but it is run 64 or 32 PETS per node with traditional threading. I suspect it will fail the same way with one thread and a packed node 128 ranks per node.

…

On Thu, Dec 19, 2024 at 9:16 PM Gerhard Theurich ***@***.***> wrote: @GeorgeVandenberghe-NOAA <https://github.com/GeorgeVandenberghe-NOAA> I looked at the memory tracing, and it looks to me that the run dies because of memory pressure on the nodes that run the WAV component. WAV in this run is setup to execute on 998 PETs. Does the WAV configuration work on that number of PETs under traditional threading? — Reply to this email directly, view it on GitHub <#2540 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ANDS4FUIVNRBUSDKJT7RSOD2GN4WXAVCNFSM6AAAAABT3QB4ASVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNJWGEZDMNZZHA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- George W Vandenberghe *Lynker Technologies at * NOAA/NWS/NCEP/EMC 5830 University Research Ct., Rm. 2141 College Park, MD 20740 ***@***.*** 301-683-3769(work) 3017751547(cell)

GeorgeVandenberghe-NOAA · 2024-12-20T02:22:33Z

I will report the memory pressure to the WAVE people.

…

On Thu, Dec 19, 2024 at 9:16 PM Gerhard Theurich ***@***.***> wrote: @GeorgeVandenberghe-NOAA <https://github.com/GeorgeVandenberghe-NOAA> I looked at the memory tracing, and it looks to me that the run dies because of memory pressure on the nodes that run the WAV component. WAV in this run is setup to execute on 998 PETs. Does the WAV configuration work on that number of PETs under traditional threading? — Reply to this email directly, view it on GitHub <#2540 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ANDS4FUIVNRBUSDKJT7RSOD2GN4WXAVCNFSM6AAAAABT3QB4ASVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNJWGEZDMNZZHA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- George W Vandenberghe *Lynker Technologies at * NOAA/NWS/NCEP/EMC 5830 University Research Ct., Rm. 2141 College Park, MD 20740 ***@***.*** 301-683-3769(work) 3017751547(cell)

theurich · 2024-12-20T02:31:00Z

You could try running WAV with different threading levels under ESMF-managed threading. E.g.

# WAV #
WAV_model:                      ww3
WAV_petlist_bounds:             8296 9293
WAV_omp_num_threads:            2
WAV_attributes::
  Verbosity = 0
  OverwriteSlice = false
  mesh_wav = mesh.uglo_m1g16.nc
  user_sets_restname = false
::

To run 2x threaded, therefore using 64 tasks per node, or with

WAV_omp_num_threads:            4

for 4x way threaded, using 32 tasks per node. Still using 998 cores in total for any of those cases, just changing the threading level. Would be curious to see how that changes things.

GeorgeVandenberghe-NOAA · 2024-12-20T12:06:29Z

Thanks, I'll check it out.

…

On Thu, Dec 19, 2024 at 9:31 PM Gerhard Theurich ***@***.***> wrote: You could try running WAV with different threading levels under ESMF-managed threading. E.g. # WAV # WAV_model: ww3 WAV_petlist_bounds: 8296 9293 WAV_omp_num_threads: 2 WAV_attributes:: Verbosity = 0 OverwriteSlice = false mesh_wav = mesh.uglo_m1g16.nc user_sets_restname = false :: To run 2x threaded, therefore using 64 cores per node, or with WAV_omp_num_threads: 4 for 4x way threaded, using 32 cores per node. Still using 998 cores in any of those cases, just changing the threading level. Would be curious to see how that changes things. — Reply to this email directly, view it on GitHub <#2540 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ANDS4FTO3GUWJWDC7CE6F5T2GN6PVAVCNFSM6AAAAABT3QB4ASVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNJWGE2DCMBXGQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- George W Vandenberghe *Lynker Technologies at * NOAA/NWS/NCEP/EMC 5830 University Research Ct., Rm. 2141 College Park, MD 20740 ***@***.*** 301-683-3769(work) 3017751547(cell)

JessicaMeixner-NOAA · 2024-12-20T13:45:54Z

@GeorgeVandenberghe-NOAA I looked at the memory tracing, and it looks to me that the run dies because of memory pressure on the nodes that run the WAV component. WAV in this run is setup to execute on 998 PETs. Does the WAV configuration work on that number of PETs under traditional threading?

Thanks @GeorgeVandenberghe-NOAA and @theurich - I just wanted to acknowledge here that the wave people have seen this. @DeniseWorthen has also observed the wave memory issues and has done some work to address some of the issues, which can be seen in a draft PR here: NOAA-EMC/WW3#1317

GeorgeVandenberghe-NOAA · 2024-12-20T13:52:38Z

The other choke point is inline post. I can run this with traditional threads with 256 ranks per I/O group but it fails with ESMF managed threads on a packed node with 256 and 512 MPI ranks per I/O group on gaea C5. This will be a serious issue increasing resolution on WCOSS2 also and I am puzzled where the memory for THAT is being used. 512 tasks are spread between four nodes each with 256GB of memory!

…

On Fri, Dec 20, 2024 at 8:46 AM Jessica Meixner ***@***.***> wrote: @GeorgeVandenberghe-NOAA <https://github.com/GeorgeVandenberghe-NOAA> I looked at the memory tracing, and it looks to me that the run dies because of memory pressure on the nodes that run the WAV component. WAV in this run is setup to execute on 998 PETs. Does the WAV configuration work on that number of PETs under traditional threading? Thanks @GeorgeVandenberghe-NOAA <https://github.com/GeorgeVandenberghe-NOAA> and @theurich <https://github.com/theurich> - I just wanted to acknowledge here that the wave people have seen this. @DeniseWorthen <https://github.com/DeniseWorthen> has also observed the wave memory issues and has done some work to address some of the issues, which can be seen in a draft PR here: NOAA-EMC/WW3#1317 <NOAA-EMC/WW3#1317> — Reply to this email directly, view it on GitHub <#2540 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ANDS4FR55EDIB7G6VCHSEMD2GQNSTAVCNFSM6AAAAABT3QB4ASVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNJXGA2DQNJTG4> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- George W Vandenberghe *Lynker Technologies at * NOAA/NWS/NCEP/EMC 5830 University Research Ct., Rm. 2141 College Park, MD 20740 ***@***.*** 301-683-3769(work) 3017751547(cell)

DeniseWorthen · 2024-12-20T14:09:51Z

It doesn't look like the case @GeorgeVandenberghe-NOAA is pointing to (/gpfs/f5/scratch/gwv/hr4j/da) has the PIO in WW3 enabled. Is that intentional?

JessicaMeixner-NOAA · 2024-12-20T14:12:39Z

I can try to get @GeorgeVandenberghe-NOAA a test case with PIO enabled by the end of the day - with @sbanihash help we almost have a PR ready for g-w to generate a new test case

GeorgeVandenberghe-NOAA · 2024-12-20T14:14:33Z

How is this enabled? For this situation inline post failure trumps the wave memory issues anyway and wave memory issues can be addressed with additional ESMF threads. UFS post memory issues have persisted with two threads and 256 or 512 ranks per I/O group, 128 ranks per node.

…

On Fri, Dec 20, 2024 at 9:10 AM Denise Worthen ***@***.***> wrote: It doesn't look like the case @GeorgeVandenberghe-NOAA <https://github.com/GeorgeVandenberghe-NOAA> is pointing to (/gpfs/f5/scratch/gwv/hr4j/da) has the PIO in WW3 enabled. Is that intentional? — Reply to this email directly, view it on GitHub <#2540 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ANDS4FUSLVQE6RZD3U6YFSL2GQQMLAVCNFSM6AAAAABT3QB4ASVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNJXGA4DSNJZGI> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- George W Vandenberghe *Lynker Technologies at * NOAA/NWS/NCEP/EMC 5830 University Research Ct., Rm. 2141 College Park, MD 20740 ***@***.*** 301-683-3769(work) 3017751547(cell)

GeorgeVandenberghe-NOAA · 2024-12-20T14:15:35Z

Is there a namelist thing I can toggle?

…

On Fri, Dec 20, 2024 at 9:13 AM Jessica Meixner ***@***.***> wrote: I can try to get @GeorgeVandenberghe-NOAA <https://github.com/GeorgeVandenberghe-NOAA> a test case with PIO enabled by the end of the day - with @sbanihash <https://github.com/sbanihash> help we almost have a PR ready for g-w to generate a new test case — Reply to this email directly, view it on GitHub <#2540 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ANDS4FRCXORKJNDEHXJAXGD2GQQW5AVCNFSM6AAAAABT3QB4ASVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNJXGA4TGOBWGY> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- George W Vandenberghe *Lynker Technologies at * NOAA/NWS/NCEP/EMC 5830 University Research Ct., Rm. 2141 College Park, MD 20740 ***@***.*** 301-683-3769(work) 3017751547(cell)

DeniseWorthen · 2024-12-20T14:29:14Z

You can toggle inline post off in model_configure (write_dopost: .false.).

To toggle PIO for WW3 the model needs to have been compiled w/ PIO in the switch for WW3. I don't know if you case has that or not.

theurich · 2024-12-20T16:07:08Z

@DeniseWorthen and @JessicaMeixner-NOAA It's great to know wave people are aware of the memory pressure coming from WW3, and even greater there is already a PR to address it! Do you think George should be testing here with WW3 changes from that PR?

@GeorgeVandenberghe-NOAA is the next step to attempts a run on same layout (as far as tasks and threading is concerned for each component), but with inline post off, and PIO active for WW3? We would expect a successful run. After that turn inline post back on, and observe what happens?

DeniseWorthen · 2024-12-20T16:15:32Z

@theurich There are two things that have been/can be done w/rt WW3 memory pressure. The first was implementing PIO for WW3 restarts. That has been committed, it requires compiling WW3 w/ the PIO ifdef and some additional settings in ufs.configure. It may not yet be in G-W though.

The second is a draft PR to eliminate duplicate fields. That has sat in draft because I ran into a test case---Hera+GNU+Release which did not reproduce baselines. All other cases did. I also ran cases on Hercules and Gaea and everything passed.

Hera uses a more recent GNU version though. Since the GNU+Debug passed, my supposition is that there is an optimization which is changing answers, but I have not had time to debug.

GeorgeVandenberghe-NOAA · 2024-12-20T16:21:23Z

To be very frank it's the inline post memory pressure that concerns me more because I can't fix this toggling ESMF thread count as I can with Wave and because inline post memory pressure is going to scale with ATM resolution. Inline post memory pressure also looks like a soon to be problem for traditionally threaded runs at higher than C1152 resolution too.

…

On Fri, Dec 20, 2024 at 4:15 PM Denise Worthen ***@***.***> wrote: @theurich <https://github.com/theurich> There are two things that have been/can be done w/rt WW3 memory pressure. The first was implementing PIO for WW3 restarts. That has been committed, it requires compiling WW3 w/ the PIO ifdef and some additional settings in ufs.configure. It may not yet be in G-W though. The second is a draft PR to eliminate duplicate fields. That has sat in draft because I ran into a test case---Hera+GNU+Release which did not reproduce baselines. All other cases did. I also ran cases on Hercules and Gaea and everything passed. Hera uses a more recent GNU version though. Since the GNU+Debug passed, my supposition is that there is an optimization which is changing answers, but I have not had time to debug. — Reply to this email directly, view it on GitHub <#2540 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ANDS4FTY6MHP6XR5WVF6YUD2GQ7DVAVCNFSM6AAAAABT3QB4ASVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNJXGMYDCNJYGQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- George W Vandenberghe *Lynker Technologies at * NOAA/NWS/NCEP/EMC 5830 University Research Ct., Rm. 2141 College Park, MD 20740 ***@***.*** 301-683-3769(work) 3017751547(cell)

theurich · 2024-12-20T16:28:32Z

It sounds right to focus on the inline post memory pressure issue. Do you have memory logging from a run where WAV isn't running out of memory, but where inline post is causing the issue, that I can look at? Thanks.

GeorgeVandenberghe-NOAA · 2024-12-20T17:12:01Z

I'll make one.

…

On Fri, Dec 20, 2024 at 4:28 PM Gerhard Theurich ***@***.***> wrote: It sounds right to focus on the inline post memory pressure issue. Do you have memory logging from a run where WAV isn't running out of memory, but where inline post is causing the issue, that I can look at? Thanks. — Reply to this email directly, view it on GitHub <#2540 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ANDS4FV2SSEOVARATYK4NDL2GRAUNAVCNFSM6AAAAABT3QB4ASVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNJXGMZDCOBRGA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- George W Vandenberghe *Lynker Technologies at * NOAA/NWS/NCEP/EMC 5830 University Research Ct., Rm. 2141 College Park, MD 20740 ***@***.*** 301-683-3769(work) 3017751547(cell)

GeorgeVandenberghe-NOAA · 2024-12-20T17:56:48Z

I've run with inline post and with WAVE configured with two threads to survive. It's in the same location /gpfs/f5/scratch/gwv/hr4j/da

…

On Fri, Dec 20, 2024 at 4:28 PM Gerhard Theurich ***@***.***> wrote: It sounds right to focus on the inline post memory pressure issue. Do you have memory logging from a run where WAV isn't running out of memory, but where inline post is causing the issue, that I can look at? Thanks. — Reply to this email directly, view it on GitHub <#2540 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ANDS4FV2SSEOVARATYK4NDL2GRAUNAVCNFSM6AAAAABT3QB4ASVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNJXGMZDCOBRGA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- George W Vandenberghe *Lynker Technologies at * NOAA/NWS/NCEP/EMC 5830 University Research Ct., Rm. 2141 College Park, MD 20740 ***@***.*** 301-683-3769(work) 3017751547(cell)

theurich · 2024-12-20T18:41:17Z

@GeorgeVandenberghe-NOAA It looks like the PET* log files under /gpfs/f5/scratch/gwv/hr4j/da contains output from several different runs. That makes it very hard to post process and analyze. Could you post PET* log files somewhere from just one single run that fails due to inline post?

GeorgeVandenberghe-NOAA · 2024-12-20T19:31:13Z

Okay. Had to step out of the office for a kid problem. Will resubmit to make new ones

On Friday, December 20, 2024, Gerhard Theurich ***@***.***> wrote: @GeorgeVandenberghe-NOAA It looks like the PET* log files under

/gpfs/f5/scratch/gwv/hr4j/da contains output from several different runs. That makes it very hard to post process and analyze. Could you post PET* log files somewhere from just one single run that fails due to inline post?

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.<

https://ci3.googleusercontent.com/meips/ADKq_NY--ebYQtTj0Ll2Vwt_-D4LsCnGOLOj6Urkb9c2QUR9PAdTNa_x39uflVaduNqLXhtc5v8Q9AM4wknv-b64T20-KkWhOF-QZtSTH4m0QfbMqm2PUpGth7AOsXWJRHQ2mC3UoO3JXqCNnpjqdsNS25wmwmQS1LXGOll4xKPgpqXioxodn2IYtTrjWDs4fb4bPfpRKH9ycz59HVnJTc6qwAehRgIBKm19uxiaOcfrRgq0inq6JkOT7jY=s0-d-e1-ft#https://github.com/notifications/beacon/ANDS4FVG5MZA6NBGJESBWQD2GRQGHA5CNFSM6AAAAABT3QB4ASWGG33NNVSW45C7OR4XAZNMJFZXG5LFINXW23LFNZ2KUY3PNVWWK3TUL5UWJTUYODPTI.gif>Message ID: ***@***.***>

…

-- George W Vandenberghe *Lynker Technologies at * NOAA/NWS/NCEP/EMC 5830 University Research Ct., Rm. 2141 College Park, MD 20740 ***@***.*** 301-683-3769(work) 3017751547(cell)

GeorgeVandenberghe-NOAA · 2024-12-20T20:36:07Z

I have resubmitted a clean run with new PET files created and it is done and they are created. Same location On Fri, Dec 20, 2024 at 2:31 PM George Vandenberghe - NOAA Affiliate < ***@***.***> wrote:

…

Okay. Had to step out of the office for a kid problem. Will resubmit to make new ones On Friday, December 20, 2024, Gerhard Theurich ***@***.***> wrote: > @GeorgeVandenberghe-NOAA It looks like the PET* log files under /gpfs/f5/scratch/gwv/hr4j/da contains output from several different runs. That makes it very hard to post process and analyze. Could you post PET* log files somewhere from just one single run that fails due to inline post? > > — > Reply to this email directly, view it on GitHub, or unsubscribe. > You are receiving this because you were mentioned.< https://ci3.googleusercontent.com/meips/ADKq_NY--ebYQtTj0Ll2Vwt_-D4LsCnGOLOj6Urkb9c2QUR9PAdTNa_x39uflVaduNqLXhtc5v8Q9AM4wknv-b64T20-KkWhOF-QZtSTH4m0QfbMqm2PUpGth7AOsXWJRHQ2mC3UoO3JXqCNnpjqdsNS25wmwmQS1LXGOll4xKPgpqXioxodn2IYtTrjWDs4fb4bPfpRKH9ycz59HVnJTc6qwAehRgIBKm19uxiaOcfrRgq0inq6JkOT7jY=s0-d-e1-ft#https://github.com/notifications/beacon/ANDS4FVG5MZA6NBGJESBWQD2GRQGHA5CNFSM6AAAAABT3QB4ASWGG33NNVSW45C7OR4XAZNMJFZXG5LFINXW23LFNZ2KUY3PNVWWK3TUL5UWJTUYODPTI.gif>Message ID: ***@***.***> -- George W Vandenberghe *Lynker Technologies at * NOAA/NWS/NCEP/EMC 5830 University Research Ct., Rm. 2141 College Park, MD 20740 ***@***.*** 301-683-3769(work) 3017751547(cell)

-- George W Vandenberghe *Lynker Technologies at * NOAA/NWS/NCEP/EMC 5830 University Research Ct., Rm. 2141 College Park, MD 20740 ***@***.*** 301-683-3769(work) 3017751547(cell)

theurich · 2024-12-20T21:18:57Z

What happens if you increase the threading level for ATM, e.g. to 4x?

# ATM #
ATM_model:                      fv3
ATM_petlist_bounds:             0 7935
ATM_omp_num_threads:            4
ATM_attributes::
  Verbosity = 0
  DumpFields = false
  ProfileMemory = false
  OverwriteSlice = true
::

This change requires you also change model_configure to keep giving the same number of packed cores to the WRT components:

quilting:                .true.
quilting_restart:        .true.
write_groups:            2
write_tasks_per_group:   128

With this, the FCST component still gets the first 6912 cores (now using them with 1728 tasks 4x threaded). The two WRT comps get each 128x4 = 512 cores as before, now using those cores with 128 tasks 4x threaded.

Does that reduce the memory pressure?

theurich · 2024-12-20T21:27:11Z

In future runs, could you set Verbosity=high for all of the components (that currently set it Verbosity = 0). Just more context for when looking at the PET* logs. Thanks!

GeorgeVandenberghe-NOAA · 2024-12-20T22:04:52Z

I will try that

…

On Fri, Dec 20, 2024 at 4:19 PM Gerhard Theurich ***@***.***> wrote: What happens if you increase the threading level for ATM, e.g. to 4x? # ATM # ATM_model: fv3 ATM_petlist_bounds: 0 7935 ATM_omp_num_threads: 4 ATM_attributes:: Verbosity = 0 DumpFields = false ProfileMemory = false OverwriteSlice = true :: This change requires you also change model_configure to keep giving the same number of packed cores to the WRT components: quilting: .true. quilting_restart: .true. write_groups: 2 write_tasks_per_group: 128 With this, the FCST component still gets the first 6912 cores (now using them with 1728 tasks 4x threaded). The two WRT comps get each 128x4 = 512 cores as before, now using those cores with 128 tasks 4x threaded. Does that reduce the memory pressure? — Reply to this email directly, view it on GitHub <#2540 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ANDS4FXZB5HCEMVY7DBJLRL2GSCVRAVCNFSM6AAAAABT3QB4ASVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNJXG4ZDSOJXGY> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- George W Vandenberghe *Lynker Technologies at * NOAA/NWS/NCEP/EMC 5830 University Research Ct., Rm. 2141 College Park, MD 20740 ***@***.*** 301-683-3769(work) 3017751547(cell)

GeorgeVandenberghe-NOAA added the bug Something isn't working label Dec 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UFS_WEATHER_MODEL HR.v4 cannot be run with fully packed nodes on Gaea C5 at C1152 resolution #2540

UFS_WEATHER_MODEL HR.v4 cannot be run with fully packed nodes on Gaea C5 at C1152 resolution #2540

GeorgeVandenberghe-NOAA commented Dec 18, 2024

GeorgeVandenberghe-NOAA commented Dec 18, 2024

theurich commented Dec 19, 2024

GeorgeVandenberghe-NOAA commented Dec 19, 2024 via email

GeorgeVandenberghe-NOAA commented Dec 19, 2024 via email

GeorgeVandenberghe-NOAA commented Dec 19, 2024 via email

theurich commented Dec 19, 2024 •

edited

Loading

GeorgeVandenberghe-NOAA commented Dec 19, 2024 via email

theurich commented Dec 20, 2024

GeorgeVandenberghe-NOAA commented Dec 20, 2024 via email

GeorgeVandenberghe-NOAA commented Dec 20, 2024 via email

theurich commented Dec 20, 2024 •

edited

Loading

GeorgeVandenberghe-NOAA commented Dec 20, 2024 via email

JessicaMeixner-NOAA commented Dec 20, 2024

GeorgeVandenberghe-NOAA commented Dec 20, 2024 via email

DeniseWorthen commented Dec 20, 2024

JessicaMeixner-NOAA commented Dec 20, 2024

GeorgeVandenberghe-NOAA commented Dec 20, 2024 via email

GeorgeVandenberghe-NOAA commented Dec 20, 2024 via email

DeniseWorthen commented Dec 20, 2024

theurich commented Dec 20, 2024

DeniseWorthen commented Dec 20, 2024

GeorgeVandenberghe-NOAA commented Dec 20, 2024 via email

theurich commented Dec 20, 2024

GeorgeVandenberghe-NOAA commented Dec 20, 2024 via email

GeorgeVandenberghe-NOAA commented Dec 20, 2024 via email

theurich commented Dec 20, 2024

GeorgeVandenberghe-NOAA commented Dec 20, 2024 via email

GeorgeVandenberghe-NOAA commented Dec 20, 2024 via email

theurich commented Dec 20, 2024

theurich commented Dec 20, 2024

GeorgeVandenberghe-NOAA commented Dec 20, 2024 via email

UFS_WEATHER_MODEL HR.v4 cannot be run with fully packed nodes on Gaea C5 at C1152 resolution #2540

UFS_WEATHER_MODEL HR.v4 cannot be run with fully packed nodes on Gaea C5 at C1152 resolution #2540

Comments

GeorgeVandenberghe-NOAA commented Dec 18, 2024

GeorgeVandenberghe-NOAA commented Dec 18, 2024

theurich commented Dec 19, 2024

GeorgeVandenberghe-NOAA commented Dec 19, 2024 via email

GeorgeVandenberghe-NOAA commented Dec 19, 2024 via email

GeorgeVandenberghe-NOAA commented Dec 19, 2024 via email

theurich commented Dec 19, 2024 • edited Loading

GeorgeVandenberghe-NOAA commented Dec 19, 2024 via email

theurich commented Dec 20, 2024

GeorgeVandenberghe-NOAA commented Dec 20, 2024 via email

GeorgeVandenberghe-NOAA commented Dec 20, 2024 via email

theurich commented Dec 20, 2024 • edited Loading

GeorgeVandenberghe-NOAA commented Dec 20, 2024 via email

JessicaMeixner-NOAA commented Dec 20, 2024

GeorgeVandenberghe-NOAA commented Dec 20, 2024 via email

DeniseWorthen commented Dec 20, 2024

JessicaMeixner-NOAA commented Dec 20, 2024

GeorgeVandenberghe-NOAA commented Dec 20, 2024 via email

GeorgeVandenberghe-NOAA commented Dec 20, 2024 via email

DeniseWorthen commented Dec 20, 2024

theurich commented Dec 20, 2024

DeniseWorthen commented Dec 20, 2024

GeorgeVandenberghe-NOAA commented Dec 20, 2024 via email

theurich commented Dec 20, 2024

GeorgeVandenberghe-NOAA commented Dec 20, 2024 via email

GeorgeVandenberghe-NOAA commented Dec 20, 2024 via email

theurich commented Dec 20, 2024

GeorgeVandenberghe-NOAA commented Dec 20, 2024 via email

GeorgeVandenberghe-NOAA commented Dec 20, 2024 via email

theurich commented Dec 20, 2024

theurich commented Dec 20, 2024

GeorgeVandenberghe-NOAA commented Dec 20, 2024 via email

theurich commented Dec 19, 2024 •

edited

Loading

theurich commented Dec 20, 2024 •

edited

Loading