-
Notifications
You must be signed in to change notification settings - Fork 457
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
client, server, web: enable BUDA GPU apps #5960
Conversation
If you make a variant of a BUDA app for a plan class (e.g. NVIDIA GPU with CUDA) this ensures that jobs submitted to that variant are sent only to capable hosts, and that the host usage and projected FLOPS are set correctly. On the web side, we add a <plan_class> element to workunit.xml_doc. This gets sent to the scheduler. On the scheduler this required some reorganization. As the scheduler scans jobs, it finds and caches a BEST_APP_VERSION for each app. This contains a HOST_USAGE. In the case of BUDA, the host usage depends on the workunit, not the app version. We might scan several BUDA jobs they'll all use the same APP_VERSION, but they could have different plan classes and therefore different HOST_USAGE. So if we're looking at a job to send, and the WU has a <plan_class> element, call app_plan() to check the host capability and get the host usage. Change add_result_to_reply() so that it takes a HOST_USAGE& argument, rather than getting it from the BEST_APP_VERSION. We do this in several places: - sched_array (old scheduling policy) - sched_score (new scheduling policy) - sched_locality (locality scheduling) - sched_resend (resending lost jobs) - sched_assign (assigned jobs) so all these functions work properly with BUDA apps. ----------------- Also: the input and output templates for a BUDA app variant depend only on the variant, not on batches or jobs. So generate them when the variant is created, and store them in the variant dir, rather than generating them on batch submission Also: fix bug in downloading batch output as .zip; need to do zip -q
with a list of BUDA variant names (i.e. plan classes). Update as variants are added and deleted. This is used in project preferences for 'Use NVIDIA' type buttons. feeder: the shared-mem segment has a list of resources types for which the project has work. Need to include BUDA variants also. Do this by scanning the 'buda_plan_classes' file (see above) Note: this means that when the set of BUDA variants changes, we need to restart the project plan_class_spec.xml.sample: The 'cuda' class had a max compute capability of 200. Remove it.
with the workunit rather than the app version. This commit lays the groundword for this.
put resource usage info in the <workunit> element.
original: Info about resource usage (GPU usage, #cpus) is stored in APP_VERSION. When we need this info for a RESULT, we look at rp->avp new: For BUDA apps, the info about the actual app (not the docker wrapper) comes with the workunit, not the app version. So create a new structure, RESOURCE_USAGE. APP_VERSION has one, WORKUNIT has one. So does RESULT; when we create the result we copy the struct either from the app version or (for BUDA jobs) the workunit. Then the code can just reference rp->resource_usage. Nice. This enables BUDA/GPU functionality with almost no additional complexity. Add code to parse resource usage items in <workunit> Note: info about missing GPUs (or GPUS without needed libraries) is also stored in RESOURCE_USAGE.
@davidpanderson, please fix build errors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot reviewed 20 out of 40 changed files in this pull request and generated no comments.
Files not reviewed (20)
- client/app.cpp: Language not supported
- client/app_config.cpp: Language not supported
- client/app_control.cpp: Language not supported
- client/app_start.cpp: Language not supported
- client/client_state.cpp: Language not supported
- client/client_types.cpp: Language not supported
- client/client_types.h: Language not supported
- client/coproc_sched.cpp: Language not supported
- client/coproc_sched.h: Language not supported
- client/cpu_sched.cpp: Language not supported
- client/cs_scheduler.cpp: Language not supported
- client/cs_statefile.cpp: Language not supported
- client/log_flags.cpp: Language not supported
- client/project.cpp: Language not supported
- client/result.cpp: Language not supported
- client/result.h: Language not supported
- client/rr_sim.cpp: Language not supported
- client/work_fetch.cpp: Language not supported
- db/boinc_db_types.h: Language not supported
- html/inc/app_types.inc: Language not supported
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot reviewed 22 out of 42 changed files in this pull request and generated no comments.
Files not reviewed (20)
- client/app.cpp: Language not supported
- client/app_config.cpp: Language not supported
- client/app_control.cpp: Language not supported
- client/app_start.cpp: Language not supported
- client/client_state.cpp: Language not supported
- client/client_types.cpp: Language not supported
- client/client_types.h: Language not supported
- client/coproc_sched.cpp: Language not supported
- client/coproc_sched.h: Language not supported
- client/cpu_sched.cpp: Language not supported
- client/cs_scheduler.cpp: Language not supported
- client/cs_statefile.cpp: Language not supported
- client/log_flags.cpp: Language not supported
- client/project.cpp: Language not supported
- client/result.cpp: Language not supported
- client/result.h: Language not supported
- client/rr_sim.cpp: Language not supported
- client/sim.cpp: Language not supported
- client/sim_util.cpp: Language not supported
- client/work_fetch.cpp: Language not supported
@davidpanderson, unfortunately, still failing:
also, please run |
possibly fixed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot reviewed 22 out of 42 changed files in this pull request and generated no comments.
Files not reviewed (20)
- client/app.cpp: Language not supported
- client/app_config.cpp: Language not supported
- client/app_control.cpp: Language not supported
- client/app_start.cpp: Language not supported
- client/client_state.cpp: Language not supported
- client/client_types.cpp: Language not supported
- client/client_types.h: Language not supported
- client/coproc_sched.cpp: Language not supported
- client/coproc_sched.h: Language not supported
- client/cpu_sched.cpp: Language not supported
- client/cs_scheduler.cpp: Language not supported
- client/cs_statefile.cpp: Language not supported
- client/log_flags.cpp: Language not supported
- client/project.cpp: Language not supported
- client/result.cpp: Language not supported
- client/result.h: Language not supported
- client/rr_sim.cpp: Language not supported
- client/sim.cpp: Language not supported
- client/sim_util.cpp: Language not supported
- client/work_fetch.cpp: Language not supported
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #5960 +/- ##
============================================
- Coverage 10.73% 10.70% -0.03%
Complexity 1068 1068
============================================
Files 280 280
Lines 36619 36709 +90
Branches 8489 8515 +26
============================================
Hits 3930 3930
- Misses 32300 32390 +90
Partials 389 389
|
With BUDA, the (BOINC) app version has just the docker wrapper.
Everything else (Dockerfile, executables) is part of the workunit,
along with the input files.
We want to support GPU applications in BUDA.
That implies that:
is part of the workunit rather than the app version
the workunit that's sent to and stored on the client.
This required changes to both scheduler and client
(and to a small extent web).
Fortunately I was able to keep the changes fairly simple.
DB:
When we create a BUDA workunit, we stores its plan class
as an element of its xml_doc, where the scheduler can see it.
Scheduler:
If a workunit has a plan class,
call the plan class function to see if we can send it to the host
and if so to get the usage info.
Include the usage info in the element in the scheduler reply.
Feeder:
It makes a list of GPU types the project can use;
this is used in scheduler replies.
This list now must reflect not only APP_VERSION plan classes,
but also BUDA app variants.
We do this using a file 'buda_plan_classes'
that's maintained by the web code.
Client:
A new struct RESOURCE_USAGE has GPU/CPU usage info.
APP_VERSION and WORKUNIT (for BUDA jobs) both have one.
The appropriate one is copied to RESULT when it's created.
Scheduling and work fetch code references this copy.
Scheduler protocol:
now can include plan class and resource usage info