Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error: This Fiber is a zombie #131

Open
glasser opened this issue Jul 16, 2013 · 13 comments
Open

Error: This Fiber is a zombie #131

glasser opened this issue Jul 16, 2013 · 13 comments

Comments

@glasser
Copy link

glasser commented Jul 16, 2013

If I'm able to catch an exception with the text Error: this Fiber is a zombie, then that's a fibers bug, right? Not user error?

Working on finding a minimal reproduction, but while we're working on that, curious if we're correct that this is definitely a Fibers bug... (In 1.0.0; going to try 1.0.1 next.)

@glasser
Copy link
Author

glasser commented Jul 16, 2013

OK, I figured out what we're doing.

We're calling Fiber.yield in a fiber that we don't keep any reference to. So it eventually gets GCd. This causes stack unwinding, which throws from the Fiber.yield. But our Fiber.yield is wrapped (very indirectly) in a try/finally. And the finally clause itself has a f.wait in it.

The effect for us (on OSX at least) is that the main fiber actually gets re-run from the top!

Here's a reproduction:

var Fiber = require('fibers');

// Run this every so often so that GC and DestroyOrphans happens.
setInterval(function () {
  Fiber(function () { global.gc() }).run();
}, 1000);

Fiber(function () {
  console.log("TOP");
  try {
    Fiber.yield();
  } finally {
    console.log("finally");
    var f = Fiber.current;
    setTimeout(function () {
      console.log("re-running");
      f.run();
      console.log("re-ran");
    }, 10);
    console.log("yielding");
    Fiber.yield();
    console.log("yeld");
  }
}).run();
console.log("finishing main program");

For me, with Node 0.8.24 and Fibers 1.0.0, this prints:

$ node --expose-gc zombie.js 
TOP
finishing main program
finally
yielding
re-running
TOP
re-ran
finally
yielding
re-running
TOP
re-ran
finally
yielding
re-running
TOP
re-ran
finally
yielding
re-running
TOP
re-ran

and so forth.

@glasser
Copy link
Author

glasser commented Jul 16, 2013

So I guess the short answer is, I shouldn't be letting a fiber that I care about get GCed. And in fact, the strategy of using try { Fiber.yield } finally { cleanup which involves waiting } is problematic.

But it certainly was surprising that the fiber got re-run from the top!

@glasser
Copy link
Author

glasser commented Jul 16, 2013

Also occurs with Node 0.10.13, Fibers 1.0.1.

@glasser
Copy link
Author

glasser commented Jul 16, 2013

At the very least, I think doing this should cause your code to crash, not re-run the fiber from the top. We would have figured out what was going on much earlier if that had happened.

@glasser
Copy link
Author

glasser commented Jul 16, 2013

Ah, OK. The second call to Fiber.yield immediately re-throws the "zombie" exception, so by the time the callback runs, the fiber has completely unwound. And you can run fibers multiple times... and that's what the f.run() does.

Now, for me, the f.run() was actually hidden inside a future.return(). I think it is very surprising that future.return() (or throw) can call fiber.run on a fiber where fiber.started is false (ie, a fiber that may have been running when the wait was called but is not any more). Can this happen for any reason other than that the Future.wait got terminated by a zombie exception? If that's the only way that this can happen, can cb in Future.wait check fiber.started and exit with a message otherwise?

@laverdet
Copy link
Owner

So I guess the short answer is, I shouldn't be letting a fiber that I care about get GCed. And in fact, the strategy of using try { Fiber.yield } finally { cleanup which involves waiting } is problematic.

Actually if you grab a reference to the fiber in the finally it should interrupt the unrolling process and you'll be just fine.

But it certainly was surprising that the fiber got re-run from the top!

Yeah the overloading of run() to both start and a resume a fiber was a questionable design decision in retrospect. I think the main issue that needs to be addressed here is the fact that future.return() can re-run a thrown fiber future (if I'm understanding correctly). I will have to look into this in more detail for sure.

@glasser
Copy link
Author

glasser commented Jul 17, 2013

Right, I agree that grabbing a reference to the Fiber in finally kept it from being GCed.

I don't think I was actually using FiberFuture, if that's what you meant. But yeah, I think the main issue was that a future.return() re-ran a fiber which had been waiting on that future but which got unwound. (Or well, a fiber which tried to wait on that future but when it called yield, it hit the "I'm trying to unwind, immediately rethrow zombie" check.)

AlexeyMK pushed a commit to AlexeyMK/meteor that referenced this issue Nov 13, 2013
@cmather
Copy link

cmather commented Apr 5, 2019

Hi @glasser and @laverdet,

I'm curious if you discovered a trick to prevent the zombie fiber issue?

This issue is periodically taking down our production app (not meteor). It does seem correlated to garbage collection, but I'm uncertain of the exact relationship.

I'm starting to dig into the fibers source to better understand what causes the zombie condition. If garbage collection is the source of the problem, I'm unclear why the fiber, or associated function, would be garbage collected, as it looks like the function is stored in a Persistent in fibers.cc which if I understand correctly, takes it out of garbage collection. And I'm also unclear on why storing a local variable to the fiber would somehow prevent this, since the local variable would be not reachable after a function returns, unless we store the fiber in a global structure of some sort.

I appreciate any insights, whatever they might be, as we begin our debug hunt!

Chris

Could this code cause a problem if the fiber isn't reachable on the heap?

function handleSomeRequest() {
  Fiber(function() { ... }).run();
}

@laverdet
Copy link
Owner

laverdet commented Apr 5, 2019

If the fiber isn't accessible on the heap then that means there is no way of the fiber ever waking up, because you can't ever call run on it again. In that case node-fibers will eventually wake up the fiber and force it to unwind by making Fiber.yield always throw. If you have a catch somewhere it would catch this exception.

Edit: There is nothing wrong with Fiber(function() { ... }).run(); as long as you do something with Fiber.current from inside the fiber. That's a pretty common pattern, actually. But if the fiber is waiting on an event that never comes it will eventually need to be unwound.

@cmather
Copy link

cmather commented Apr 5, 2019

Edit: Not so fast! Still debugging. Will post conclusions when we have them.

We resolved this issue, thanks to the clues you guys provided above. For posterity, in case it helps anyone else:

Reproduction:

> node --expose-gc index.js
var Fiber = require("fibers");

// Run this every so often so that GC and DestroyOrphans happens.
setInterval(function() {
  console.log("Running GC!");
  Fiber(function() {
    global.gc();
  }).run();
}, 1000);

Fiber(function() {
  try {
    Fiber.yield();
  } catch (e) {
    console.log(e);
  }
}).run();

In the C++ class backing Fiber, the fiber is stored in a Persistent<Object> handle which removes the object from the garbage collection process. However, in the constructor function for Fiber the SetWeak method is called on the handle. This registers a callback function that will be called when the v8 garbage collector runs when there are no more references to the object (other than the weak handle). This callback function then adds the fiber to the orphaned fibers array where it is turned into a zombie next time any fiber runs. Without a try/catch block, this zombie error seems to be swallowed.

The fix, as @glasser mentions above, is to add a local reference to the fiber. But the trick is to store this reference inside the fiber callback function itself. Otherwise we'd have to somehow keep track when the fiber has finished running (a meta gc!).

Fiber(function() {
  // add a ref to the current fiber, tricking the v8 gc so that the weak handle callback isn't called.
  var f = Fiber.current;
  try {
    Fiber.yield();
  } catch (e) {
    console.log(e);
  }
}).run();

@laverdet
Copy link
Owner

laverdet commented Apr 5, 2019

I must warn you that your fix here replaces an error with a memory leak.

@cmather
Copy link

cmather commented Apr 5, 2019

@laverdet - Thanks for your response and this extra hint. Your explanation makes sense. It looks like we must still be missing something in our app. If we didn't have a handle to the fiber somewhere then yeah we wouldn't be able to call run() on it to resume.

Edit: I hadn't seen your response before I wrote mine.

@ailabs-software
Copy link

ailabs-software commented Jul 29, 2020

For anyone else reaching this issue from Google and worrying that something is wrong with node-fibers:

It may very well be a bug in your code or another library where a callback is never invoked which that fibre is waiting on.

I had this error as well, "This Fiber is a zombie."

It turned out to be because of a bug in my application code.

In my case, I had implemented a re-entrant lock (mutex) which allows first caller to acquire lock, then requires successive calls to wait until the first caller calls release(). The problem ended up being that my rather hastily written lock implementation only stored the first callback for waiting (blocked) fibres! In fact, it wasn't even storing a reference to successive waiting (yielded) fibres.

So, naturally the garbage collector is GCing the fibre as there are no references to it, in my case.

If you are dealing with a web application, look for requests which are timing out or finishing with an error.
(if you hold a reference to each fibre created, I found the request affected by the bug will never complete -- until the heap is exhausted!)

Hope this helps someone stop wasting time like I did worrying about a bug in node-fibers before checking their own code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants