client, server, web: enable BUDA GPU apps #5960

davidpanderson · 2024-12-15T08:13:50Z

With BUDA, the (BOINC) app version has just the docker wrapper.
Everything else (Dockerfile, executables) is part of the workunit,
along with the input files.

We want to support GPU applications in BUDA.
That implies that:

The plan class (saying what GPU type and driver are needed)
is part of the workunit rather than the app version
The resource usage (GPU, fraction of CPU) is part of
the workunit that's sent to and stored on the client.

This required changes to both scheduler and client
(and to a small extent web).
Fortunately I was able to keep the changes fairly simple.

DB:
When we create a BUDA workunit, we stores its plan class
as an element of its xml_doc, where the scheduler can see it.

Scheduler:
If a workunit has a plan class,
call the plan class function to see if we can send it to the host
and if so to get the usage info.
Include the usage info in the element in the scheduler reply.

Feeder:
It makes a list of GPU types the project can use;
this is used in scheduler replies.
This list now must reflect not only APP_VERSION plan classes,
but also BUDA app variants.
We do this using a file 'buda_plan_classes'
that's maintained by the web code.

Client:
A new struct RESOURCE_USAGE has GPU/CPU usage info.
APP_VERSION and WORKUNIT (for BUDA jobs) both have one.
The appropriate one is copied to RESULT when it's created.
Scheduling and work fetch code references this copy.

Scheduler protocol:
now can include plan class and resource usage info

If you make a variant of a BUDA app for a plan class (e.g. NVIDIA GPU with CUDA) this ensures that jobs submitted to that variant are sent only to capable hosts, and that the host usage and projected FLOPS are set correctly. On the web side, we add a <plan_class> element to workunit.xml_doc. This gets sent to the scheduler. On the scheduler this required some reorganization. As the scheduler scans jobs, it finds and caches a BEST_APP_VERSION for each app. This contains a HOST_USAGE. In the case of BUDA, the host usage depends on the workunit, not the app version. We might scan several BUDA jobs they'll all use the same APP_VERSION, but they could have different plan classes and therefore different HOST_USAGE. So if we're looking at a job to send, and the WU has a <plan_class> element, call app_plan() to check the host capability and get the host usage. Change add_result_to_reply() so that it takes a HOST_USAGE& argument, rather than getting it from the BEST_APP_VERSION. We do this in several places: - sched_array (old scheduling policy) - sched_score (new scheduling policy) - sched_locality (locality scheduling) - sched_resend (resending lost jobs) - sched_assign (assigned jobs) so all these functions work properly with BUDA apps. ----------------- Also: the input and output templates for a BUDA app variant depend only on the variant, not on batches or jobs. So generate them when the variant is created, and store them in the variant dir, rather than generating them on batch submission Also: fix bug in downloading batch output as .zip; need to do zip -q

with a list of BUDA variant names (i.e. plan classes). Update as variants are added and deleted. This is used in project preferences for 'Use NVIDIA' type buttons. feeder: the shared-mem segment has a list of resources types for which the project has work. Need to include BUDA variants also. Do this by scanning the 'buda_plan_classes' file (see above) Note: this means that when the set of BUDA variants changes, we need to restart the project plan_class_spec.xml.sample: The 'cuda' class had a max compute capability of 200. Remove it.

with the workunit rather than the app version. This commit lays the groundword for this.

put resource usage info in the <workunit> element.

original: Info about resource usage (GPU usage, #cpus) is stored in APP_VERSION. When we need this info for a RESULT, we look at rp->avp new: For BUDA apps, the info about the actual app (not the docker wrapper) comes with the workunit, not the app version. So create a new structure, RESOURCE_USAGE. APP_VERSION has one, WORKUNIT has one. So does RESULT; when we create the result we copy the struct either from the app version or (for BUDA jobs) the workunit. Then the code can just reference rp->resource_usage. Nice. This enables BUDA/GPU functionality with almost no additional complexity. Add code to parse resource usage items in <workunit> Note: info about missing GPUs (or GPUS without needed libraries) is also stored in RESOURCE_USAGE.

AenBleidd · 2024-12-15T11:13:34Z

@davidpanderson, please fix build errors.

Copilot reviewed 20 out of 40 changed files in this pull request and generated no comments.

Files not reviewed (20)

client/app.cpp: Language not supported
client/app_config.cpp: Language not supported
client/app_control.cpp: Language not supported
client/app_start.cpp: Language not supported
client/client_state.cpp: Language not supported
client/client_types.cpp: Language not supported
client/client_types.h: Language not supported
client/coproc_sched.cpp: Language not supported
client/coproc_sched.h: Language not supported
client/cpu_sched.cpp: Language not supported
client/cs_scheduler.cpp: Language not supported
client/cs_statefile.cpp: Language not supported
client/log_flags.cpp: Language not supported
client/project.cpp: Language not supported
client/result.cpp: Language not supported
client/result.h: Language not supported
client/rr_sim.cpp: Language not supported
client/work_fetch.cpp: Language not supported
db/boinc_db_types.h: Language not supported
html/inc/app_types.inc: Language not supported

Copilot reviewed 22 out of 42 changed files in this pull request and generated no comments.

Files not reviewed (20)

client/app.cpp: Language not supported
client/app_config.cpp: Language not supported
client/app_control.cpp: Language not supported
client/app_start.cpp: Language not supported
client/client_state.cpp: Language not supported
client/client_types.cpp: Language not supported
client/client_types.h: Language not supported
client/coproc_sched.cpp: Language not supported
client/coproc_sched.h: Language not supported
client/cpu_sched.cpp: Language not supported
client/cs_scheduler.cpp: Language not supported
client/cs_statefile.cpp: Language not supported
client/log_flags.cpp: Language not supported
client/project.cpp: Language not supported
client/result.cpp: Language not supported
client/result.h: Language not supported
client/rr_sim.cpp: Language not supported
client/sim.cpp: Language not supported
client/sim_util.cpp: Language not supported
client/work_fetch.cpp: Language not supported

AenBleidd · 2024-12-15T20:36:09Z

@davidpanderson, unfortunately, still failing:

sched_shmem.cpp: In function ‘void get_buda_plan_classes(std::vector<std::__cxx11::basic_string<char> >&)’:
sched_shmem.cpp:116:28: error: cannot convert ‘FCGI_FILE*’ to ‘FILE*’
  116 |     while (fgets(buf, 256, f)) {
      |                            ^
      |                            |
      |                            FCGI_FILE*

also, please run python3 ci_tools/trailing_whitespaces_check.py . --fix to remove extra whitespaces at the end of the lines.
Thank you.

davidpanderson · 2024-12-15T21:24:45Z

possibly fixed

Copilot reviewed 22 out of 42 changed files in this pull request and generated no comments.

Files not reviewed (20)

client/app.cpp: Language not supported
client/app_config.cpp: Language not supported
client/app_control.cpp: Language not supported
client/app_start.cpp: Language not supported
client/client_state.cpp: Language not supported
client/client_types.cpp: Language not supported
client/client_types.h: Language not supported
client/coproc_sched.cpp: Language not supported
client/coproc_sched.h: Language not supported
client/cpu_sched.cpp: Language not supported
client/cs_scheduler.cpp: Language not supported
client/cs_statefile.cpp: Language not supported
client/log_flags.cpp: Language not supported
client/project.cpp: Language not supported
client/result.cpp: Language not supported
client/result.h: Language not supported
client/rr_sim.cpp: Language not supported
client/sim.cpp: Language not supported
client/sim_util.cpp: Language not supported
client/work_fetch.cpp: Language not supported

codecov · 2024-12-15T21:39:53Z

Codecov Report

Attention: Patch coverage is 0% with 105 lines in your changes missing coverage. Please review.

Project coverage is 10.70%. Comparing base (14a2a26) to head (8e8ccf8).
Report is 28 commits behind head on master.

Files with missing lines	Patch %	Lines
sched/sched_send.cpp	0.00%	54 Missing ⚠️
sched/sched_score.cpp	0.00%	12 Missing ⚠️
sched/sched_shmem.cpp	0.00%	12 Missing ⚠️
sched/sched_array.cpp	0.00%	6 Missing ⚠️
sched/sched_assign.cpp	0.00%	6 Missing ⚠️
sched/sched_locality.cpp	0.00%	6 Missing ⚠️
sched/sched_resend.cpp	0.00%	6 Missing ⚠️
sched/sched_customize.cpp	0.00%	1 Missing ⚠️
sched/sched_nci.cpp	0.00%	1 Missing ⚠️
sched/sched_version.cpp	0.00%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##             master    #5960      +/-   ##
============================================
- Coverage     10.73%   10.70%   -0.03%     
  Complexity     1068     1068              
============================================
  Files           280      280              
  Lines         36619    36709      +90     
  Branches       8489     8515      +26     
============================================
  Hits           3930     3930              
- Misses        32300    32390      +90     
  Partials        389      389

Files with missing lines	Coverage Δ
db/boinc_db_types.h	`0.00% <ø> (ø)`
sched/sched_send.h	`0.00% <ø> (ø)`
sched/sched_types.h	`0.00% <ø> (ø)`
sched/sched_customize.cpp	`0.00% <0.00%> (ø)`
sched/sched_nci.cpp	`0.00% <0.00%> (ø)`
sched/sched_version.cpp	`0.00% <0.00%> (ø)`
sched/sched_array.cpp	`0.00% <0.00%> (ø)`
sched/sched_assign.cpp	`0.00% <0.00%> (ø)`
sched/sched_locality.cpp	`0.00% <0.00%> (ø)`
sched/sched_resend.cpp	`0.00% <0.00%> (ø)`
... and 3 more

davidpanderson added 6 commits December 11, 2024 17:00

scheduler: if a job is BUDA, we need to return usage info (CPU, GPU)

b10221b

with the workunit rather than the app version. This commit lays the groundword for this.

scheduler: for BUDA GPU jobs,

8151445

put resource usage info in the <workunit> element.

win build fixes

f5fc8b9

AenBleidd requested a review from Copilot December 15, 2024 15:50

Copilot AI reviewed Dec 15, 2024

View reviewed changes

AenBleidd added C: Client C: Server PR: Enhancement labels Dec 15, 2024

AenBleidd added this to the Client/Manager 8.2.0 milestone Dec 15, 2024

Fix client simulator build

1e77aeb

AenBleidd requested a review from Copilot December 15, 2024 20:30

Copilot AI reviewed Dec 15, 2024

View reviewed changes

davidpanderson added 2 commits December 15, 2024 13:22

scheduler: fix FCGI build

280c838

trailing white space

8e8ccf8

AenBleidd requested a review from Copilot December 15, 2024 21:38

Copilot AI reviewed Dec 15, 2024

View reviewed changes

AenBleidd approved these changes Dec 15, 2024

View reviewed changes

AenBleidd added the C: Server - Scheduler label Dec 15, 2024

AenBleidd merged commit 14f7546 into master Dec 15, 2024
152 of 153 checks passed

AenBleidd deleted the dpa_buda6 branch December 15, 2024 23:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

client, server, web: enable BUDA GPU apps #5960

client, server, web: enable BUDA GPU apps #5960

davidpanderson commented Dec 15, 2024

AenBleidd commented Dec 15, 2024

AenBleidd commented Dec 15, 2024

davidpanderson commented Dec 15, 2024

codecov bot commented Dec 15, 2024 •

edited

Loading

client, server, web: enable BUDA GPU apps #5960

client, server, web: enable BUDA GPU apps #5960

Conversation

davidpanderson commented Dec 15, 2024

AenBleidd commented Dec 15, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AenBleidd commented Dec 15, 2024

davidpanderson commented Dec 15, 2024

Choose a reason for hiding this comment

codecov bot commented Dec 15, 2024 • edited Loading

Codecov Report

codecov bot commented Dec 15, 2024 •

edited

Loading