Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add ggml_backend_sched_dump_dot #10825

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

foldl
Copy link
Contributor

@foldl foldl commented Dec 14, 2024

This PR add a DOT dump function to sched. Comparing to the existing ggml_graph_dump_dot, this function:

  • backend and buft are color coded into the background of nodes.
  • Shows graph splits.

A demo of the usefulness: In this graph, we can find there is something abnormal at a glance. This is caused by ggml_rms_norm_inplace for input layer norm, which is probably an error in the scheduler.

image

Note:

  1. ggml_graph_get_grad is also fixed for cgraph->grads is NULL.
  2. define GGML_DOT_FULL_COLOR for full color, otherwise, a color scheme is used.

PS: Someone might say that the dumped graph is too large to be rendered. In my case (chatllm.cpp), I use --layer_spec to load only 2 or 3 layers.

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Dec 14, 2024
ggml/src/ggml-backend.cpp Outdated Show resolved Hide resolved
@slaren
Copy link
Collaborator

slaren commented Dec 15, 2024

The discrepancy between buffer type and backend may be caused by a few reasons. ggml_backend_sched ignores non-executable view ops, so they may end with random assignments that are not relevant while running the graph. Copies of tensors in the splits do not have assignments in the hash table at all, since they are implicitly allocated in the split backend.

Fundamentally, ggml_backend_sched does not work on the graph directly, it works on the list of nodes that are the topological representation of the graph. Thus, I do not think that trying to reason about it by representing what it does as a graph is going to be useful. The best way to understand what ggml_backend_sched is doing is by looking at the list of nodes, which can be obtained by setting the environment variable GGML_SCHED_DEBUG to 2. I am afraid that this is just going to result to misunderstandings about what is actually happening in ggml_backend_sched, and lead to false bug reports like the supposed problem that you found.

@foldl
Copy link
Contributor Author

foldl commented Dec 15, 2024

Update:

  1. Color is deduced from index;
  2. Hue is used for GGML_DOT_FULL_COLOR.

@foldl
Copy link
Contributor Author

foldl commented Dec 15, 2024

... lead to false bug reports like the supposed problem that you found.

The issue is:

  • ggml API ggml_backend_sched_set_tensor_backend is not used correctly, or
  • xxx_inplace, view, or reshape operators are not handled properly.

I used this to identify the issue, updated my code (chatllm.cpp), and things worked.

This proposed function is not for debugging the scheduler, but for visualization of graph splits and backends. It may also help for debugging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants