Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

client, server, web: enable BUDA GPU apps #5960

Merged
merged 9 commits into from
Dec 15, 2024
Merged

client, server, web: enable BUDA GPU apps #5960

merged 9 commits into from
Dec 15, 2024

Conversation

davidpanderson
Copy link
Contributor

With BUDA, the (BOINC) app version has just the docker wrapper.
Everything else (Dockerfile, executables) is part of the workunit,
along with the input files.

We want to support GPU applications in BUDA.
That implies that:

  • The plan class (saying what GPU type and driver are needed)
    is part of the workunit rather than the app version
  • The resource usage (GPU, fraction of CPU) is part of
    the workunit that's sent to and stored on the client.

This required changes to both scheduler and client
(and to a small extent web).
Fortunately I was able to keep the changes fairly simple.

DB:
When we create a BUDA workunit, we stores its plan class
as an element of its xml_doc, where the scheduler can see it.

Scheduler:
If a workunit has a plan class,
call the plan class function to see if we can send it to the host
and if so to get the usage info.
Include the usage info in the element in the scheduler reply.

Feeder:
It makes a list of GPU types the project can use;
this is used in scheduler replies.
This list now must reflect not only APP_VERSION plan classes,
but also BUDA app variants.
We do this using a file 'buda_plan_classes'
that's maintained by the web code.

Client:
A new struct RESOURCE_USAGE has GPU/CPU usage info.
APP_VERSION and WORKUNIT (for BUDA jobs) both have one.
The appropriate one is copied to RESULT when it's created.
Scheduling and work fetch code references this copy.

Scheduler protocol:
now can include plan class and resource usage info

If you make a variant of a BUDA app for a plan class
(e.g. NVIDIA GPU with CUDA)
this ensures that jobs submitted to that variant are sent
only to capable hosts,
and that the host usage and projected FLOPS are set correctly.

On the web side, we add a <plan_class> element to workunit.xml_doc.
This gets sent to the scheduler.

On the scheduler this required some reorganization.
As the scheduler scans jobs, it finds and caches
a BEST_APP_VERSION for each app.
This contains a HOST_USAGE.

In the case of BUDA, the host usage depends on the workunit,
not the app version.
We might scan several BUDA jobs
they'll all use the same APP_VERSION,
but they could have different plan classes
and therefore different HOST_USAGE.

So if we're looking at a job to send,
and the WU has a <plan_class> element,
call app_plan() to check the host capability and get the host usage.

Change add_result_to_reply() so that it takes a HOST_USAGE& argument,
rather than getting it from the BEST_APP_VERSION.

We do this in several places:
- sched_array (old scheduling policy)
- sched_score (new scheduling policy)
- sched_locality (locality scheduling)
- sched_resend (resending lost jobs)
- sched_assign (assigned jobs)
so all these functions work properly with BUDA apps.

-----------------

Also: the input and output templates for a BUDA app variant
depend only on the variant, not on batches or jobs.
So generate them when the variant is created,
and store them in the variant dir,
rather than generating them on batch submission

Also: fix bug in downloading batch output as .zip;
need to do zip -q
    with a list of BUDA variant names (i.e. plan classes).
    Update as variants are added and deleted.
    This is used in project preferences for 'Use NVIDIA' type buttons.

feeder: the shared-mem segment has a list of resources types
    for which the project has work.
    Need to include BUDA variants also.
    Do this by scanning the 'buda_plan_classes' file (see above)

    Note: this means that when the set of BUDA variants changes,
    we need to restart the project

plan_class_spec.xml.sample:
    The 'cuda' class had a max compute capability of 200.
    Remove it.
with the workunit rather than the app version.
This commit lays the groundword for this.
put resource usage info in the <workunit> element.
original:
Info about resource usage (GPU usage, #cpus) is stored in APP_VERSION.
When we need this info for a RESULT, we look at rp->avp

new:
For BUDA apps, the info about the actual app (not the docker wrapper)
comes with the workunit, not the app version.
So create a new structure, RESOURCE_USAGE.
APP_VERSION has one, WORKUNIT has one.
So does RESULT; when we create the result we copy the struct
either from the app version or (for BUDA jobs) the workunit.
Then the code can just reference rp->resource_usage.

Nice. This enables BUDA/GPU functionality with almost no additional complexity.

Add code to parse resource usage items in <workunit>

Note: info about missing GPUs (or GPUS without needed libraries)
is also stored in RESOURCE_USAGE.
@AenBleidd
Copy link
Member

@davidpanderson, please fix build errors.

@AenBleidd AenBleidd requested a review from Copilot December 15, 2024 15:50

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 20 out of 40 changed files in this pull request and generated no comments.

Files not reviewed (20)
  • client/app.cpp: Language not supported
  • client/app_config.cpp: Language not supported
  • client/app_control.cpp: Language not supported
  • client/app_start.cpp: Language not supported
  • client/client_state.cpp: Language not supported
  • client/client_types.cpp: Language not supported
  • client/client_types.h: Language not supported
  • client/coproc_sched.cpp: Language not supported
  • client/coproc_sched.h: Language not supported
  • client/cpu_sched.cpp: Language not supported
  • client/cs_scheduler.cpp: Language not supported
  • client/cs_statefile.cpp: Language not supported
  • client/log_flags.cpp: Language not supported
  • client/project.cpp: Language not supported
  • client/result.cpp: Language not supported
  • client/result.h: Language not supported
  • client/rr_sim.cpp: Language not supported
  • client/work_fetch.cpp: Language not supported
  • db/boinc_db_types.h: Language not supported
  • html/inc/app_types.inc: Language not supported
@AenBleidd AenBleidd requested a review from Copilot December 15, 2024 20:30

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 22 out of 42 changed files in this pull request and generated no comments.

Files not reviewed (20)
  • client/app.cpp: Language not supported
  • client/app_config.cpp: Language not supported
  • client/app_control.cpp: Language not supported
  • client/app_start.cpp: Language not supported
  • client/client_state.cpp: Language not supported
  • client/client_types.cpp: Language not supported
  • client/client_types.h: Language not supported
  • client/coproc_sched.cpp: Language not supported
  • client/coproc_sched.h: Language not supported
  • client/cpu_sched.cpp: Language not supported
  • client/cs_scheduler.cpp: Language not supported
  • client/cs_statefile.cpp: Language not supported
  • client/log_flags.cpp: Language not supported
  • client/project.cpp: Language not supported
  • client/result.cpp: Language not supported
  • client/result.h: Language not supported
  • client/rr_sim.cpp: Language not supported
  • client/sim.cpp: Language not supported
  • client/sim_util.cpp: Language not supported
  • client/work_fetch.cpp: Language not supported
@AenBleidd
Copy link
Member

@davidpanderson, unfortunately, still failing:

sched_shmem.cpp: In function ‘void get_buda_plan_classes(std::vector<std::__cxx11::basic_string<char> >&)’:
sched_shmem.cpp:116:28: error: cannot convert ‘FCGI_FILE*’ to ‘FILE*’
  116 |     while (fgets(buf, 256, f)) {
      |                            ^
      |                            |
      |                            FCGI_FILE*

also, please run python3 ci_tools/trailing_whitespaces_check.py . --fix to remove extra whitespaces at the end of the lines.
Thank you.

@davidpanderson
Copy link
Contributor Author

possibly fixed

@AenBleidd AenBleidd requested a review from Copilot December 15, 2024 21:38

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 22 out of 42 changed files in this pull request and generated no comments.

Files not reviewed (20)
  • client/app.cpp: Language not supported
  • client/app_config.cpp: Language not supported
  • client/app_control.cpp: Language not supported
  • client/app_start.cpp: Language not supported
  • client/client_state.cpp: Language not supported
  • client/client_types.cpp: Language not supported
  • client/client_types.h: Language not supported
  • client/coproc_sched.cpp: Language not supported
  • client/coproc_sched.h: Language not supported
  • client/cpu_sched.cpp: Language not supported
  • client/cs_scheduler.cpp: Language not supported
  • client/cs_statefile.cpp: Language not supported
  • client/log_flags.cpp: Language not supported
  • client/project.cpp: Language not supported
  • client/result.cpp: Language not supported
  • client/result.h: Language not supported
  • client/rr_sim.cpp: Language not supported
  • client/sim.cpp: Language not supported
  • client/sim_util.cpp: Language not supported
  • client/work_fetch.cpp: Language not supported
Copy link

codecov bot commented Dec 15, 2024

Codecov Report

Attention: Patch coverage is 0% with 105 lines in your changes missing coverage. Please review.

Project coverage is 10.70%. Comparing base (14a2a26) to head (8e8ccf8).
Report is 28 commits behind head on master.

Files with missing lines Patch % Lines
sched/sched_send.cpp 0.00% 54 Missing ⚠️
sched/sched_score.cpp 0.00% 12 Missing ⚠️
sched/sched_shmem.cpp 0.00% 12 Missing ⚠️
sched/sched_array.cpp 0.00% 6 Missing ⚠️
sched/sched_assign.cpp 0.00% 6 Missing ⚠️
sched/sched_locality.cpp 0.00% 6 Missing ⚠️
sched/sched_resend.cpp 0.00% 6 Missing ⚠️
sched/sched_customize.cpp 0.00% 1 Missing ⚠️
sched/sched_nci.cpp 0.00% 1 Missing ⚠️
sched/sched_version.cpp 0.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master    #5960      +/-   ##
============================================
- Coverage     10.73%   10.70%   -0.03%     
  Complexity     1068     1068              
============================================
  Files           280      280              
  Lines         36619    36709      +90     
  Branches       8489     8515      +26     
============================================
  Hits           3930     3930              
- Misses        32300    32390      +90     
  Partials        389      389              
Files with missing lines Coverage Δ
db/boinc_db_types.h 0.00% <ø> (ø)
sched/sched_send.h 0.00% <ø> (ø)
sched/sched_types.h 0.00% <ø> (ø)
sched/sched_customize.cpp 0.00% <0.00%> (ø)
sched/sched_nci.cpp 0.00% <0.00%> (ø)
sched/sched_version.cpp 0.00% <0.00%> (ø)
sched/sched_array.cpp 0.00% <0.00%> (ø)
sched/sched_assign.cpp 0.00% <0.00%> (ø)
sched/sched_locality.cpp 0.00% <0.00%> (ø)
sched/sched_resend.cpp 0.00% <0.00%> (ø)
... and 3 more

@AenBleidd AenBleidd merged commit 14f7546 into master Dec 15, 2024
152 of 153 checks passed
@AenBleidd AenBleidd deleted the dpa_buda6 branch December 15, 2024 23:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Merged
Status: Done
Development

Successfully merging this pull request may close these issues.

2 participants