You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Oct 31, 2018. It is now read-only.
BlockManager experiences a race condition in the following case:
Block A is being removed from memory by thread 1, and saved to disk by thread 2. The following sequence of events happens:
-Thread1 calls removeBlockFromMemory, which acquires a lock on the block info.
-Before Thread1 actually removes blockA, thread2 calls updateBlockInfo on write. updateBlockInfoOnWrite tries to get the block info from blockInfo, and it's still there
-Thread 1 removes block A and releases its lock
-Thread 2 acquires a lock on blockA. At this point, blockA has been removed from blockInfo, but thread2 has no way of realizing that, and updates the BlockInfo for a block that no longer exists.
This later results in an exception, when the thread that wrote blockA to disk tries to access its block info.
This bug is manifesting when I try to run some of the new shuffle code, in the InputOutputMetricsSuite.
@christophercanel is this something you can fix?
The text was updated successfully, but these errors were encountered:
This is a similar issue to phenomenon that is documents in a comment in doGetLocal(). I can fix it, but there are a couple of options about how to go about it.
Option1: Once a thread acquires a BlockInfo lock, it needs to double check that the block is actually still there.
Option 2: Expand the locking functionality that the BlockInfo class provides to include a counter that tracks how many threads are waiting on it's lock. The remove methods would then only remove a BlockInfo object if no other threads are waiting for its lock.
I'm leaning towards option 1 because it's simpler. What do you think?
This commit changes the resultBlockId used by RddComputeMonotasks from
being a RDDBlockId to being a MonotaskResultBlockId. There's no reason this
result to use a RDDBlockId (because it's temporary data and not where the RDD
will more permanently be stored), and storing it with RDDBlockId can sometimes
trigger a race condition in BlockManager between when the monotask's result
gets cleaned up and when a DiskWriteMonotask writes the result
(NetSys/spark-monotasks#26).
This commit changes the resultBlockId used by RddComputeMonotasks from
being a RDDBlockId to being a MonotaskResultBlockId. There's no reason this
result to use a RDDBlockId (because it's temporary data and not where the RDD
will more permanently be stored), and storing it with RDDBlockId can sometimes
trigger a race condition in BlockManager between when the monotask's result
gets cleaned up and when a DiskWriteMonotask writes the result
(#26).
Sign up for freeto subscribe to this conversation on GitHub.
Already have an account?
Sign in.
BlockManager experiences a race condition in the following case:
Block A is being removed from memory by thread 1, and saved to disk by thread 2. The following sequence of events happens:
-Thread1 calls removeBlockFromMemory, which acquires a lock on the block info.
-Before Thread1 actually removes blockA, thread2 calls updateBlockInfo on write. updateBlockInfoOnWrite tries to get the block info from blockInfo, and it's still there
-Thread 1 removes block A and releases its lock
-Thread 2 acquires a lock on blockA. At this point, blockA has been removed from blockInfo, but thread2 has no way of realizing that, and updates the BlockInfo for a block that no longer exists.
This later results in an exception, when the thread that wrote blockA to disk tries to access its block info.
This bug is manifesting when I try to run some of the new shuffle code, in the InputOutputMetricsSuite.
@christophercanel is this something you can fix?
The text was updated successfully, but these errors were encountered: