Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training on CPU produces division by zero error when computing progress value to display. #6615

Open
1 task done
sztejkat opened this issue Dec 28, 2024 · 0 comments
Open
1 task done
Labels
bug Something isn't working

Comments

@sztejkat
Copy link

Describe the bug

During training when progress is slow following error is reported:

    (....)
      File "***removed****/modules/training.py", line 717, in do_train
        timer_info = f"`{1.0/its:.2f}` s/it"
                         ~~~^~~~
    ZeroDivisionError: float division by zero
    (...)

Code inspection shows that it isn't immune to zero value.

As a work around I changed in above file lines like following:

            line 713: in its = tracked.current_steps / time_elapsed
                            if its > 0:
                               if its > 1:
                                   timer_info = f"`{its:.2f}` it/s"
                               else:
                                    timer_info = f"`{1.0/its:.2f}` s/it"
            
                               total_time_estimate = (1.0 / its) * (tracked.max_steps)     

Is there an existing issue for this?

  • I have searched the existing issues

Reproduction

Hard to do, depends on hardware, software setup, system load and etc.

Screenshot

No response

Logs

See description.

System Info

Commit hash: 4d466d5c80eb83892b7dfb76fa4ab69efd6d6989
version tag: 2.0

Os: Linux, some old ubuntu distro. CPU only.
@sztejkat sztejkat added the bug Something isn't working label Dec 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant