Bug in BlockManager's concurrency control #26

kayousterhout · 2015-07-17T21:31:40Z

BlockManager experiences a race condition in the following case:

Block A is being removed from memory by thread 1, and saved to disk by thread 2. The following sequence of events happens:
-Thread1 calls removeBlockFromMemory, which acquires a lock on the block info.
-Before Thread1 actually removes blockA, thread2 calls updateBlockInfo on write. updateBlockInfoOnWrite tries to get the block info from blockInfo, and it's still there
-Thread 1 removes block A and releases its lock
-Thread 2 acquires a lock on blockA. At this point, blockA has been removed from blockInfo, but thread2 has no way of realizing that, and updates the BlockInfo for a block that no longer exists.

This later results in an exception, when the thread that wrote blockA to disk tries to access its block info.

This bug is manifesting when I try to run some of the new shuffle code, in the InputOutputMetricsSuite.

@christophercanel is this something you can fix?

ccanel · 2015-07-17T21:42:56Z

This is a similar issue to phenomenon that is documents in a comment in doGetLocal(). I can fix it, but there are a couple of options about how to go about it.

Option1: Once a thread acquires a BlockInfo lock, it needs to double check that the block is actually still there.

Option 2: Expand the locking functionality that the BlockInfo class provides to include a counter that tracks how many threads are waiting on it's lock. The remove methods would then only remove a BlockInfo object if no other threads are waiting for its lock.

I'm leaning towards option 1 because it's simpler. What do you think?

kayousterhout · 2015-07-17T21:45:48Z

Yeah 1 seems like a good approach here!

This commit changes the resultBlockId used by RddComputeMonotasks from being a RDDBlockId to being a MonotaskResultBlockId. There's no reason this result to use a RDDBlockId (because it's temporary data and not where the RDD will more permanently be stored), and storing it with RDDBlockId can sometimes trigger a race condition in BlockManager between when the monotask's result gets cleaned up and when a DiskWriteMonotask writes the result (NetSys/spark-monotasks#26).

This commit changes the resultBlockId used by RddComputeMonotasks from being a RDDBlockId to being a MonotaskResultBlockId. There's no reason this result to use a RDDBlockId (because it's temporary data and not where the RDD will more permanently be stored), and storing it with RDDBlockId can sometimes trigger a race condition in BlockManager between when the monotask's result gets cleaned up and when a DiskWriteMonotask writes the result (#26).

kayousterhout assigned ccanel Jul 17, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug in BlockManager's concurrency control #26

Bug in BlockManager's concurrency control #26

kayousterhout commented Jul 17, 2015

ccanel commented Jul 17, 2015

kayousterhout commented Jul 17, 2015

Bug in BlockManager's concurrency control #26

Bug in BlockManager's concurrency control #26

Comments

kayousterhout commented Jul 17, 2015

ccanel commented Jul 17, 2015

kayousterhout commented Jul 17, 2015