Skip to content
This repository has been archived by the owner on Oct 31, 2018. It is now read-only.

Bug in BlockManager's concurrency control #26

Open
kayousterhout opened this issue Jul 17, 2015 · 2 comments
Open

Bug in BlockManager's concurrency control #26

kayousterhout opened this issue Jul 17, 2015 · 2 comments
Assignees

Comments

@kayousterhout
Copy link
Member

BlockManager experiences a race condition in the following case:

Block A is being removed from memory by thread 1, and saved to disk by thread 2. The following sequence of events happens:
-Thread1 calls removeBlockFromMemory, which acquires a lock on the block info.
-Before Thread1 actually removes blockA, thread2 calls updateBlockInfo on write. updateBlockInfoOnWrite tries to get the block info from blockInfo, and it's still there
-Thread 1 removes block A and releases its lock
-Thread 2 acquires a lock on blockA. At this point, blockA has been removed from blockInfo, but thread2 has no way of realizing that, and updates the BlockInfo for a block that no longer exists.

This later results in an exception, when the thread that wrote blockA to disk tries to access its block info.

This bug is manifesting when I try to run some of the new shuffle code, in the InputOutputMetricsSuite.

@christophercanel is this something you can fix?

@ccanel
Copy link

ccanel commented Jul 17, 2015

This is a similar issue to phenomenon that is documents in a comment in doGetLocal(). I can fix it, but there are a couple of options about how to go about it.

Option1: Once a thread acquires a BlockInfo lock, it needs to double check that the block is actually still there.

Option 2: Expand the locking functionality that the BlockInfo class provides to include a counter that tracks how many threads are waiting on it's lock. The remove methods would then only remove a BlockInfo object if no other threads are waiting for its lock.

I'm leaning towards option 1 because it's simpler. What do you think?

@kayousterhout
Copy link
Member Author

Yeah 1 seems like a good approach here!

kayousterhout added a commit to kayousterhout/spark-1 that referenced this issue Jul 20, 2015
This commit changes the resultBlockId used by RddComputeMonotasks from
being a RDDBlockId to being a MonotaskResultBlockId. There's no reason this
result to use a RDDBlockId (because it's temporary data and not where the RDD
will more permanently be stored), and storing it with RDDBlockId can sometimes
trigger a race condition in BlockManager between when the monotask's result
gets cleaned up and when a DiskWriteMonotask writes the result
(NetSys/spark-monotasks#26).
kayousterhout added a commit that referenced this issue Jul 22, 2015
This commit changes the resultBlockId used by RddComputeMonotasks from
being a RDDBlockId to being a MonotaskResultBlockId. There's no reason this
result to use a RDDBlockId (because it's temporary data and not where the RDD
will more permanently be stored), and storing it with RDDBlockId can sometimes
trigger a race condition in BlockManager between when the monotask's result
gets cleaned up and when a DiskWriteMonotask writes the result
(#26).
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants