Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

An workaround for conflict resolve failure #1087

Open
wants to merge 2 commits into
base: release/hydrogen
Choose a base branch
from

Conversation

workingenius
Copy link

@workingenius workingenius commented Nov 25, 2020

Issue here: https://github.com/couchbase/couchbase-lite-core/issues/1084

As we don't know the real cause of the problem, we came up with a workaround.

Every time a rev tree is saved, check if conflict tags are in consistent state. Fix it if not.

It may not be the best solution, please check. Hope it helps.

@borrrden
Copy link
Member

Unfortunately this breaks some existing functionality (see the PR validation result). This area is very hard to touch without precise knowledge of what is going on. However you noted that you are unable to reproduce the issue, so how are you testing this fix?

@workingenius
Copy link
Author

Unfortunately this breaks some existing functionality (see the PR validation result). This area is very hard to touch without precise knowledge of what is going on. However you noted that you are unable to reproduce the issue, so how are you testing this fix?

Yeah, you are right. We don't have very precise knowledge of the details and the code, so we would like some help from the community and finally have it fixed totally.

Our situation is simple, and we are just using limited features supplied by couchbase & lite. So we don't need to consider every detail like you do. We are going to run the patched version on some devices for a relatively long time. We hope it will not ruin the "good data" 😄 , and then, if there are less conflict problems, that make sense to us.

@workingenius
Copy link
Author

@borrrden Could we talk about this? I don't understand the test case that failed.

core/Replicator/tests/ReplicatorLoopbackTest.cc, line 1257.

    // For https://github.com/couchbase/sync_gateway/issues/3359
    C4Slice docID = C4STR("Khan");

    {
        TransactionHelper t(db);
        createRev(db,  docID, C4STR("1-11111111"), kFleeceBody);
        createConflictingRev(db, docID, C4STR("1-11111111"), C4STR("2-22222222"));
        createConflictingRev(db, docID, C4STR("1-11111111"), C4STR("2-ffffffff"));
        createConflictingRev(db, docID, C4STR("2-22222222"), C4STR("3-33333333"));
    }
    _expectedDocumentCount = 1;
    runPullReplication();

    c4::ref<C4Document> doc = c4doc_get(db2, docID, true, nullptr);
    REQUIRE(doc);
	C4Slice revID = C4STR("3-33333333");
    CHECK(doc->selectedRev.revID == revID);
    CHECK((doc->flags & kDocConflicted) == 0);  // locally in db there is no conflict

After the transaction, the rev tree should look like this (and the loopback replication should not change the rev tree except remote tags):

1-11111
|
|----------------------|
|                      |
2-22222                2-fffffff
|
|
|
3-33333 [CURRENT]

It diverged at the second generation, and 2-ffffffff is a leaf that is not on the current branch. So in my opinion it should be conflicted and it should have a kRevIsConflict flag. Why does the comment says "locally in db there is no conflict"?

And here is another confusion. If we call hasConflict method, it should return false. As there are more than one "Active" revs (3-333 and 2-fff), but the document carries a non-conflict flag. I know the flags of the document are updated from flags of the "selected rev" (here should be 3-333, the current) just before saving, what's the consideration here? I guess there's some of your thinking that I didn't get.

Thank you

@borrrden
Copy link
Member

Your analysis is actually very on point, I didn't think most outside people would read so far into it. However there is one small gotcha that you missed. If I were to pull the document out of db, what you said is 100% correct and I checked that is the case. However the line you noted is on a document that came out of db2 (i.e. the other side of the replication), not db.

The conflict was not inserted into db2, and replication does not consider all leafs of a conflicted document (more accurately the platform above LiteCore strives not to have these conflicted leafs in the database in the first place). You can still manually create them here since this is a core test, but this is actually trying to simulate the behavior of another situation (you can find a long discussion of the scenario in the issue that is linked in the first line of the test):

  1. db replicates 1-2-3 tree to db2, so the current revision in db2 is 3-3333 (no conflicts in db2, as you noticed in the test)
  2. db deletes 3-3333, making 2-ffff the new winning revision
  3. db replicates this new winner to db2
  4. db2 recognizes that the last revision it received before 2-ffff was 3-3333, and that it locally has 3-3333 still. This means that the other side has for some reason switched to "another branch" (in this case because it deleted the previous winning branch) and that it should follow suit by performing similar actions.

This is a very delicate situation that actually I was not involved in writing the solution for, so there are a lot of checks that happen here to ensure that this is really what happened (if you are curious you can see them inside of TreeDocument::putExistingRevision, noting the part that has the comments about switched branches).

@workingenius
Copy link
Author

By the way, where should I commit new issues if it's about android but not core?
I asked some questions here. couchbase/couchbase-lite-java-ce-root#12.
And, we found more problems on replication.
Thank you.

@borrrden
Copy link
Member

borrrden commented Dec 3, 2020

If you mean file issues, the android repo is here. The repo you mentioned is for Java desktop.

However, if you want to ask questions please use the forums

@workingenius
Copy link
Author

If you mean file issues, the android repo is here. The repo you mentioned is for Java desktop.

However, if you want to ask questions please use the forums

But the repo you gave is deprecated, and redirected to this.

@borrrden
Copy link
Member

borrrden commented Dec 3, 2020

Oh! I forgot about that and you caught me. ce-root is the correct repo for filling issues!

@workingenius
Copy link
Author

As I keep debugging, I wonder why a C4DocPutRequest has a "allowConflict" field?
That is to say, why do we need to allow conflict revs to exist (but not tagged)?
I think conflicts should always be tagged.

@borrrden
Copy link
Member

borrrden commented Dec 9, 2020

It has nothing to do with tagging or not tagging. That flag is simply whether or not to outright refuse to insert (false) or not (true). The former can be used when a local insert is attempted because we can be guaranteed that the caller knows both revisions (the one they tried to insert, and the one they can retrieve from the DB). The latter is for replicated conflicts, since the "caller" is basically a network message in the background. In order to properly resolve, we save the conflict temporarily, and then later in the resolution phase grab both versions and resolve them.

@workingenius
Copy link
Author

workingenius commented Dec 10, 2020

It has nothing to do with tagging or not tagging. That flag is simply whether or not to outright refuse to insert (false) or not (true). The former can be used when a local insert is attempted because we can be guaranteed that the caller knows both revisions (the one they tried to insert, and the one they can retrieve from the DB). The latter is for replicated conflicts, since the "caller" is basically a network message in the background. In order to properly resolve, we save the conflict temporarily, and then later in the resolution phase grab both versions and resolve them.

Yeah, you are right, I made a mistake.
What I care is markConflict argument of RevTree::insertHistory method, but not allowConflict.
if markConflict is false, an inserted revision that cause a conflict will not be tagged as conflict.
Why do we need this argument?

When replication goes, the conflict tag is important. We need it to tell two things:

  1. If a conflict happens or not (If conflict resolve will be triggered).
  2. Which rev is the main cause of the conflict (and it should be put together with CURRENT rev to resolve). In java codes, it traverses all leaves and pick the first revision tagged conflict to be the main cause.

So if the conflict tags break down, conflict resolve will not work, and it finally put our data into an inconsistent state.

From my perspective, to make conflict resolve work, a leaf revision should be either CURRENT or CONFLICT. But there's an argument markConflict to break the rule. Are there any situation that we must allow multi non-conflict leaves?
In fact that's what failed the auto test case, an "inconsistent" tree is constructed in the case. My code tried to "fix" it but breaks the following CHECKs.

@borrrden
Copy link
Member

I actually am unsure about that part of the code. I don't often get the chance to look at it. I wonder if @snej could answer.

@snej
Copy link
Collaborator

snej commented Mar 17, 2021

Hey @workingenius, I'm impressed that you've dug so deeply into this code, some of which (RevTree) is very old and has a lot of history :) and technical debt associated with it. Let me try to explain. Conveniently, writing this will also help me remember some details that I haven't thought about in a while...

Your confusion about markConflict/isConflict has to do with design changes we made late in the 2.0 dev cycle that made conflict handling work differently than in 1.x or in CouchDB. Previously, the pull side of replication copied the entire remote rev-tree of a document; the goal is that every replica of a database has the same exact rev-tree. In CBL 2.x we don't do that. Instead we only pull the active/current branch from SG. CBL also only allows one local branch -- the public API (CBL, not LiteCore) doesn't allow you to create local/local conflicts. So a conflict always exists only between the one local branch and the one remote branch. (There can be multiple remote branches if you replicate with multiple servers, but let's ignore that.)

Rather than rewrite RevTree according to the new way, we made smaller changes to support what we needed. (The rewrite is happening in 3.0; see the new version-vector code on the dev branch. This mode abandons rev-trees entirely.)

The failing test case "Server Conflict Branch-Switch" is for handling a condition that can happen on Sync Gateway. On SG, when a doc is in conflict between leaf revisions A and B, the conflict can be resolved by adding a child to either A or B, making that new rev current, and adding a tombstone to the other. That means that at one point the doc's current revision can be A, and then later B' (the child of B), where B' is not a descendent of A. When the CBL replicator pulls B' it needs to detect this and edit the rev-tree so that the conflicting branch now has leaf B', not A, because that's the current server revision that the client needs to resolve with.

(to be continued)

@snej
Copy link
Collaborator

snej commented Mar 17, 2021

In that unit test, here's the situation after the initial replication. I'm using green for the current revision, red for conflicting ones. Note that db2 only has the branch with db's current revision, not the conflicting one:
image

In SECTION("Unmodified"), db deletes the main branch, so 2-ffffffff becomes the current revision. When db2 pulls, it updates its rev tree to reflect this. The result is:
image

in SECTION("Modify before 2nd pull"), db2 locally adds rev 4-4444 atop 2-ffffffff before pulling, thus creating a conflict:
image

When it resolves the conflict with 4-4444 winning, rev 2-ffffffff gets the kClosed flag, and the result is:
image

Working through this made me realize the test isn't resolving this the way CBL would. It turns out we have to preserve the server's current rev (2-ffffffff) as an ancestor of the local rev, otherwise the server will refuse to let us push the local rev (it won't let us create a server-side conflict.) That means that 2-ffffffff has to win the conflict. To preserve the local state, we make the merged rev body be the same as the local rev (4-4444). But since it has a different parent, it gets a different revID, say 3-34343434. So we end up with:
image

@snej
Copy link
Collaborator

snej commented Mar 17, 2021

Anyway! So after all this explanation and diagrams, I'm looking at the issue #1084 you reported. I agree that it's wrong for the [REMOTE#1] revision not to have the [C]onflict flag set. I'll try and figure out how this could happen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants