Race condition causing data inconsistency when nodes are coming up #148

indrekj · 2020-07-28T08:48:26Z

I think I found a race condition that is causing invalid data.

Version: latest master (7893228).

Some background: We use a slightly modified version of phoenix_pubsub which has some performance optimizations (one PR up here as well, + we added tag lookup ets table to speed up delta merges). We encountered a shard crash that seemed to happen because of our modifications. After I was able to create a failing test, I also noticed that the bug exists also on the original branch (though, in the original branch, it does not cause a shard crash but data inconsistency instead).

For us, this seems to happen when there's a network partition or Kubernetes thinks it's a good idea to move/add some pods around.

This is really hard to replicate in the real world. It usually happens for us maybe once a month.

Scenario (same as in the test but in the words):

Node A and B are connected
Alice joins node A (meta = initial)
Node A syncs with Node B
Alice is changed (meta = update1) - this update has been sent out but has not reached Node B yet
Node C joins - connects with A & B
Alice is changed (meta = update2)
Node C sends out transfer reqs (no response yet)
Node C receives delta update from A (with latest alice)
Node C receives transfer ack from Node B (which does not have the latest alice)
problem: the latest alice is overwritten by the old alice from Node B

Failing test:

test "delta before transfer from a different node", config do
  a = new(:a, config)
  b = new(:b, config)
  {a, _, _} = State.replica_up(a, b.replica)
  {b, _, _} = State.replica_up(b, a.replica)

  alice = new_pid()

  # Alice joins Node A
  a = State.join(a, alice, "lobby", :alice, "initial")

  # Node A sends updates to node B
  assert {b, [{{_, _, :alice}, _, _}], _} = State.merge(b, State.extract(a, b.replica, b.context))
  assert [:alice] = b |> State.online_list() |> keys()
  a = State.reset_delta(a)

  # Alice is updated first time
  a = State.leave(a, alice, "lobby", :alice)
  a = State.join(a, alice, "lobby", :alice, "update1")

  # update1 is not received by Node B (because of network delay or network
  # partition) or is received a lot later
  a = State.reset_delta(a)

  # Node C comes up
  c = new(:c, config)
  {b, _, _} = State.replica_up(b, c.replica)
  {a, _, _} = State.replica_up(a, c.replica)
  {c, _, _} = State.replica_up(c, a.replica)
  {c, _, _} = State.replica_up(c, b.replica)

  # Alice is updated second time
  a = State.leave(a, alice, "lobby", :alice)
  a = State.join(a, alice, "lobby", :alice, "second")

  # Lets assume Node C also sent out transfer_req to Node B here, but Node C
  # receives delta heartbeat from Node A first.
  assert {c, [{{_, _, :alice}, "second", _}], []} = State.merge(c, a.delta)

  # Here everything is fine. Node C sees the latest alice.
  assert [
    {{"lobby", _, :alice}, "second", _}
  ] = c |> State.online_list()

  # Now Node C receives transfer ack from B (who has alice with one missed update)
  assert {c, _, _} = State.merge(c, State.extract(b, c.replica, c.context))
  assert [
    {{"lobby", _, :alice}, "second", {{:a, 1}, 2}}
  ] = c |> State.online_list()
  # ^ This fails because the most recent alice is overwritten with the old
  # alice (who has "initial" now in the meta")

  # Lets say we ignore the previous inconsistency and wait for transfer ack
  # from the node A as well
  assert {c, _, _} = State.merge(c, State.extract(a, c.replica, c.context))
  assert [
    {{"lobby", _, :alice}, "second", _}
  ] = c |> State.online_list()
  # ^ This still fails - now there is no alice online at all
end

Also link: salemove@fdfe57c

Note: As this is quite complex to replicate in the real world, I cannot be 100% sure that my test is exactly what is happening. I'm fairly certain there's "values" overwriting happening because I was able to change this line to use true = :ets.insert_new and this threw an error when there were new pods coming up (it took 2 weeks to catch that though).

In case my assumptions and the test case are correct - I still don't have a good idea how to fix it...

The text was updated successfully, but these errors were encountered:

jaybe78 · 2025-01-23T21:12:47Z

I think I found a race condition that is causing invalid data.

Version: latest master (7893228).

Some background: We use a slightly modified version of phoenix_pubsub which has some performance optimizations (one PR up here as well, + we added tag lookup ets table to speed up delta merges). We encountered a shard crash that seemed to happen because of our modifications. After I was able to create a failing test, I also noticed that the bug exists also on the original branch (though, in the original branch, it does not cause a shard crash but data inconsistency instead).

For us, this seems to happen when there's a network partition or Kubernetes thinks it's a good idea to move/add some pods around.

This is really hard to replicate in the real world. It usually happens for us maybe once a month.

Scenario (same as in the test but in the words):

Node A and B are connected

Alice joins node A (meta = initial)

Node A syncs with Node B

Alice is changed (meta = update1) - this update has been sent out but has not reached Node B yet

Node C joins - connects with A & B

Alice is changed (meta = update2)

Node C sends out transfer reqs (no response yet)

Node C receives delta update from A (with latest alice)

Node C receives transfer ack from Node B (which does not have the latest alice)

problem: the latest alice is overwritten by the old alice from Node B

Failing test:

test "delta before transfer from a different node", config do
a = new(:a, config)
b = new(:b, config)
{a, _, _} = State.replica_up(a, b.replica)
{b, _, _} = State.replica_up(b, a.replica)

alice = new_pid()

Alice joins Node A

a = State.join(a, alice, "lobby", :alice, "initial")

Node A sends updates to node B

assert {b, [{{_, _, :alice}, _, _}], _} = State.merge(b, State.extract(a, b.replica, b.context))
assert [:alice] = b |> State.online_list() |> keys()
a = State.reset_delta(a)

Alice is updated first time

a = State.leave(a, alice, "lobby", :alice)
a = State.join(a, alice, "lobby", :alice, "update1")

update1 is not received by Node B (because of network delay or network

partition) or is received a lot later

a = State.reset_delta(a)

Node C comes up

c = new(:c, config)
{b, _, _} = State.replica_up(b, c.replica)
{a, _, _} = State.replica_up(a, c.replica)
{c, _, _} = State.replica_up(c, a.replica)
{c, _, _} = State.replica_up(c, b.replica)

Alice is updated second time

a = State.leave(a, alice, "lobby", :alice)
a = State.join(a, alice, "lobby", :alice, "second")

Lets assume Node C also sent out transfer_req to Node B here, but Node C

receives delta heartbeat from Node A first.

assert {c, [{{_, _, :alice}, "second", _}], []} = State.merge(c, a.delta)

Here everything is fine. Node C sees the latest alice.

assert [
{{"lobby", _, :alice}, "second", _}
] = c |> State.online_list()

Now Node C receives transfer ack from B (who has alice with one missed update)

assert {c, _, _} = State.merge(c, State.extract(b, c.replica, c.context))
assert [
{{"lobby", _, :alice}, "second", {{:a, 1}, 2}}
] = c |> State.online_list()

^ This fails because the most recent alice is overwritten with the old

alice (who has "initial" now in the meta")

Lets say we ignore the previous inconsistency and wait for transfer ack

from the node A as well

assert {c, _, _} = State.merge(c, State.extract(a, c.replica, c.context))
assert [
{{"lobby", _, :alice}, "second", _}
] = c |> State.online_list()

^ This still fails - now there is no alice online at all

end
Also link: salemove@fdfe57c

Note: As this is quite complex to replicate in the real world, I cannot be 100% sure that my test is exactly what is happening. I'm fairly certain there's "values" overwriting happening because I was able to change this line to use true = :ets.insert_new and this threw an error when there were new pods coming up (it took 2 weeks to catch that though).

In case my assumptions and the test case are correct - I still don't have a good idea how to fix it...

Did you find out whether the issue was related to the change you made in the Tracker ?
Cheers

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Race condition causing data inconsistency when nodes are coming up #148

Race condition causing data inconsistency when nodes are coming up #148

indrekj commented Jul 28, 2020 •

edited

Loading

jaybe78 commented Jan 23, 2025

Alice joins Node A

Node A sends updates to node B

Alice is updated first time

update1 is not received by Node B (because of network delay or network

partition) or is received a lot later

Node C comes up

Alice is updated second time

Lets assume Node C also sent out transfer_req to Node B here, but Node C

receives delta heartbeat from Node A first.

Here everything is fine. Node C sees the latest alice.

Now Node C receives transfer ack from B (who has alice with one missed update)

^ This fails because the most recent alice is overwritten with the old

alice (who has "initial" now in the meta")

Lets say we ignore the previous inconsistency and wait for transfer ack

from the node A as well

^ This still fails - now there is no alice online at all

Race condition causing data inconsistency when nodes are coming up #148

Race condition causing data inconsistency when nodes are coming up #148

Comments

indrekj commented Jul 28, 2020 • edited Loading

jaybe78 commented Jan 23, 2025

Alice joins Node A

Node A sends updates to node B

Alice is updated first time

update1 is not received by Node B (because of network delay or network

partition) or is received a lot later

Node C comes up

Alice is updated second time

Lets assume Node C also sent out transfer_req to Node B here, but Node C

receives delta heartbeat from Node A first.

Here everything is fine. Node C sees the latest alice.

Now Node C receives transfer ack from B (who has alice with one missed update)

^ This fails because the most recent alice is overwritten with the old

alice (who has "initial" now in the meta")

Lets say we ignore the previous inconsistency and wait for transfer ack

from the node A as well

^ This still fails - now there is no alice online at all

indrekj commented Jul 28, 2020 •

edited

Loading