Skip to content

PMIx based Debugger Operations

Ralph Castain edited this page Mar 1, 2020 · 1 revision

UNDER CONSTRUCTION

Please note that debugger operations is a subject of current RFC development. Thus, the description on this page should be considered a "draft" at this time and is provided to help stimulate discussion.

Generic explanation of debugger operations. Explain direct vs indirect launch (show mpirun daemons under system daemons to make it clearer) and how tools find/connect to servers. Explain how IO forwarding works, where queries from the debugger master tool go to, notification of job and debugger daemon termination, stop-on-exec or stop-in-init or stop-in-app.

Walk thru attach-to-running-job procedure. Two options: connect to designated starter (pid, jobid, nspace) and request daemon launch if supported - alternatively, can send spawn request to system server asking to colocate daemons with the specified job.

Regardless of how they get established, once started the daemons use PMIx_Query to obtain information on the procs assigned to them for debugging and get the local (or global) proctable. They can also get their relative rank on the node and, if there are multiple daemons on the node, use that to determine which local proc they will debug. When ready, can release the proc(s) and begin debugging operations.

Daemons can query the application proc itself - show flow of query to the local PMIx server, then down to the client proc, and return of response. Collect metadata on application operations, etc. Apps can register callback functions to reply to queries at the application level, or supporting libraries (e.g., MPI) can do so. Explain that connection path is used to minimize impact on apps.

When complete, debugger daemons can terminate, debugger master is notified.

Clone this wiki locally