Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GC is too lazy, sometimes #54937

Closed
sprig opened this issue Jun 26, 2024 · 10 comments
Closed

GC is too lazy, sometimes #54937

sprig opened this issue Jun 26, 2024 · 10 comments
Labels
GC Garbage collector

Comments

@sprig
Copy link

sprig commented Jun 26, 2024

Hi!

I'm experiencing issues in different contexts where GC doesn't run, eventually exhausting the available memory and often having OOM killer kill the julia process, despite GC having the possibility to clean much memory. I have experienced this both in numerical loops that happen to allocate memory as well as in simple data processing;

MWE:

d="/data"
fl=readdir(d,join=true)
julia> function search(c,fn)
             s=readchomp(fn)
             res=contains(s,r"TEST")
             put!(c,(res,fn))
             @info "finished"
       end
search (generic function with 1 method)

julia> c=Channel(spawn=true) do c
             for (r,f) in c
               if r
                 @show f
               end
             end
           end
Channel{Any}(0) (empty)

julia> t=@time Threads.@spawn begin
             foreach(fl) do fn
               @Threads.spawn search(c,fn)
           end
         end

f = "/data/somefile.csv"
[ Info: finished
...

Focusing on resident memory;
In a fresh julia session, I have julia using about 200M :

Screenshot 2024-06-25 at 19 15 32

After running the code above with /data having a roughly 25k files of roughly equal size, totaling ~10GB, Julia process is consuming almost 90G (which seems slightly excessive even if everything was actually loaded at once):

Screenshot 2024-06-25 at 18 38 48

After GC.gc() we're back to using 300M:

Screenshot 2024-06-25 at 19 14 51

Finally after GC.gc(true) we're up to using 400M although virtual memory usage decreased.
Screenshot 2024-06-25 at 19 15 19

If I add a Base.Event lock inside search, memory usage does not begin to increase until the Event is notifyied, so this is not due to creating excessive tasks eg. Alternatively, if I replace the @spawn loop by a simple map or even a Distributed.pmap (without using a channel to communicate back results), memory remains bounded. However, if I do the processing via @Distrubuted.spawnat :any search(channel,fn) in a loop/map, memory grows again, and again is released with an @everywhere GC.gc().

A different context where I experienced this completely unrelated to channels (although it was run in parallel using pmap), was in code that allocated new matrices in a loop for the minors of a larger matrix. A manual GC.gc() at the end of every iteration maintained a constant usage instead of having it grow unbounded (until I refactored the code to avoid allocations altogether).


First experienced here:

> versioninfo()
Julia Version 1.9.3
Commit bed2cd540a1 (2023-08-24 14:43 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 8 × Intel(R) Xeon(R) CPU E5-1620 v4 @ 3.50GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, broadwell)
  Threads: 1 on 8 virtual cores
Environment:
  JULIA_PKGDIR = /opt/julia
  JULIA_DEPOT_PATH = /opt/julia

this is via the following container running in podman on linux with the following image: https://hub.docker.com/layers/jupyter/datascience-notebook/x86_64-julia-1.9.3/images/sha256-98c2b44b4e44e044a8670ac27b201704e5222f8a7d748eb7cfd94a2cdad52e7d

As a test environment I attempted to run this in a test environment; in a fresh ubuntu podman container downloaded julia tarball from julialang.org, unpacked inside /opt/, run by path: root@fc72b99c6484:/opt# julia-1.10.4/bin/julia -t 8

> versioninfo()
Julia Version 1.10.4
Commit 48d4fd48430 (2024-06-04 10:41 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 8 × Intel(R) Xeon(R) CPU E5-1620 v4 @ 3.50GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, broadwell)
Threads: 8 default, 0 interactive, 4 GC (on 8 virtual cores)

On the host:

$ uname -a
Linux host 6.6.28-gentoo-dist-hardened #1 SMP PREEMPT_DYNAMIC Tue May  7 06:37:22 -00 2024 x86_64 Intel(R) Xeon(R) CPU E5-1620 v4 @ 3.50GHz GenuineIntel GNU/Linux
@oscardssmith
Copy link
Member

Julia 1.10 generally des a lot better here. Can you try upgrading to 1.10.4?

@sprig
Copy link
Author

sprig commented Jun 26, 2024

Thanks for the reply!

Julia 1.10 generally des a lot better here. Can you try upgrading to 1.10.4?

Sorry, you may have missed this but I mentioned this failing in a 1.10.4 test environment as well - see the second to last quote block.

However, I did some more testing and on windows 11 I was not able to reproduce this with the following version:
Edit: I had a bug in the windows code, since I had to also use ZipFile to unpack the data on the fly. I was able to reproduce on windows 11 with both 1.10.2 and 1.10.4. Here's the versioninfo from 1.10.2 (currently the default version on that machine). I can unpack the data to avoid ZipFile if necessary for some tests, but given that the MWE works on linux I've avoided this for now.

> versioninfo()
Julia Version 1.10.2
Commit bd47eca2c8 (2024-03-01 10:14 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 8 × Intel(R) Core(TM) i7-9700 CPU @ 3.00GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, skylake)
Threads: 8 default, 0 interactive, 4 GC (on 8 virtual cores)

installed via juliaup, itself installed via the windows app store. I'll check to see whether this regresses on 1.10.4. EDIT: Yes - see above.


Edit:

I tested a bit more on windows and was able to reproduce with 1.10.2,1.10.4,1.11, as well as nightly. Here's the versioninfo from nightly:

> versioninfo()
Julia Version 1.12.0-DEV.783
Commit 07f7efd835 (2024-06-25 17:20 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 8 × Intel(R) Core(TM) i7-9700 CPU @ 3.00GHz
  WORD_SIZE: 64
  LLVM: libLLVM-17.0.6 (ORCJIT, skylake)
Threads: 8 default, 0 interactive, 8 GC (on 8 virtual cores)

On windows, I eventually start seeing a string of messages like

┌ Error: OutOfMemoryError()
└ @ Main REPL[7]:9

then

Internal error: during type inference of
string(ReadOnlyMemoryError)
Encountered unexpected error in runtime:
ReadOnlyMemoryError()

and finally julia as well as the terminal window where it was running are both killed.

@sprig
Copy link
Author

sprig commented Jun 26, 2024

As a side note, I'm not entirely sure why 10G of strings should take 90G of memory, regardless of GC.

@nsajko nsajko added the GC Garbage collector label Jun 26, 2024
@nsajko
Copy link
Contributor

nsajko commented Jun 26, 2024

In my experience, if you want to cause the GC to free all memory available for freeing, GC.gc should be called multiple times in a loop, not just one time.

@gbaraldi
Copy link
Member

Can you do GC.enable_logging() ?

@sprig
Copy link
Author

sprig commented Jun 28, 2024

@gbaraldi
Did this on windows, with 1.10.2 (as above);

Initially I got a bunch of messages like so:
Screenshot 2024-06-27 at 16 52 54

This while julia is maxing the memory available to the system. Eventually I was shown this by windows:

Screenshot 2024-06-27 at 16 54 40

And shortly afterwards the julia terminal was forcibly closed.

I'll try running this again a bit later in a terminal that keeps logs of output to see whether anything more interesting is shown by the GC debug output right before julia dies.

@gbaraldi
Copy link
Member

Well it seems that the memory it's allocating is not being freed, not that the GC isn't running. But deserves a further look anyway

@sprig
Copy link
Author

sprig commented Jun 29, 2024

Well it seems that the memory it's allocating is not being freed, not that the GC isn't running. But deserves a further look anyway

Agreed. I guess it's more of a memory leak than GC not running.
I did some further testing and actually even when adding GC.gc() to the end of the search function above, I still run out of memory. Additionally, I ran this in tmux so that I could get the full report and this is how the process ends (still on windows):

GC: pause 24.19ms. collected 138.104522MB. full

GC: pause 26.80ms. collected 86.853422MB. incr
┌ Error: OutOfMemoryError()
└ @ Main REPL[10]:9
┌ Error: OutOfMemoryError()
└ @ Main REPL[10]:9
┌ Error: OutOfMemoryError()
└ @ Main REPL[10]:9
┌ Error: OutOfMemoryError()
└ @ Main REPL[10]:9
┌ Error: OutOfMemoryError()
└ @ Main REPL[10]:9
┌ Error: OutOfMemoryError()
└ @ Main REPL[10]:9
┌ Error: OutOfMemoryError()
└ @ Main REPL[10]:9
┌ Error: OutOfMemoryError()
└ @ Main REPL[10]:9
┌ Error: OutOfMemoryError()
└ @ Main REPL[10]:9
┌ Error: OutOfMemoryError()
└ @ Main REPL[10]:9
┌ Error: OutOfMemoryError()
└ @ Main REPL[10]:9
┌ Error: OutOfMemoryError()
└ @ Main REPL[10]:9
┌ Error: OutOfMemoryError()
└ @ Main REPL[10]:9
julia> ┌ Error: OutOfMemoryError()
└ @ Main REPL[10]:9
┌ Error: OutOfMemoryError()
└ @ Main REPL[10]:9
┌ Error: OutOfMemoryError()
└ @ Main REPL[10]:9
"Z:/data/file1345" => true┌ Error: OutOfMemoryError()
└ @ Main REPL[10]:9
┌ Error: OutOfMemoryError()
└ @ Main REPL[10]:9
Internal error: encountered unexpected error in runtime:
ReadOnlyMemoryError()
┌ Error: OutOfMemoryError()
└ @ Main REPL[10]:9
┌ Error: OutOfMemoryError()
└ @ Main REPL[10]:9
┌ Error: OutOfMemoryError()
└ @ Main REPL[10]:9

My current suspicion is that having an excessive amount of tasks fills up the memory while context switching. I will try this again with a feeder task/channel and a finite amount of consumer threads.

@sprig
Copy link
Author

sprig commented Jun 29, 2024

Well, using a feeder thread with a fixed amount of workers keeps the julia process using a fixed amount of memory. Sorry for the noise.

@sprig sprig closed this as completed Jun 29, 2024
@PallHaraldsson
Copy link
Contributor

I guess it's more of a memory leak than GC not running.

You seem to understand the issue and fix, but it's unclear if others will and would blame the GC. Can and should something be documented about this, that isn't currently done?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
GC Garbage collector
Projects
None yet
Development

No branches or pull requests

5 participants