Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue#18: internal_co: Always use the internal logical lines of RCSStream instead of the flat plain text to reproduce the next revision. #22

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

futatuki
Copy link

As the return value of RCSStream.invert_diff() is not to be applied for flat text content but for internal logical lines in RCSStream, which may contain some unterminated logical lines at any position, so we should use the content of internal logical lines in RCSStream as base text which is used to get the content of the next newer revision.

With this patch, it is implemented by splitting the checkout() method in TextRecord, checkout() method for external use for as in the past and checkout_as_lines() method for internal use.

See also the comments in PR #19, which will be closed as this PR is a replacement of it.

* run-test.py
  (internal_co_broken_rcsfile): New test.
  (test_list): Add it.
…t_db.

As the return value of RCSStream.invert_diff() is not to be applied
for flat text content but for internal logical lines in RCSStream,
which may contain some unterminated logical lines at any position,
so we should use the content of internal logical lines in RCSStream
as base text which is used to get the content of the next newer revision.
With this commit, it is implemented by splitting the checkout() method
in TextRecord, checkout() method for external use for as in the past
and checkout_as_lines() method for internal use.

* cvs2svn_lib/checkout_internal.py
  (TextRecord.checout_as_lines): New method. Replacement of checkout()
    method but returns internal lines in RCSStream instead of a plain
    text.
  (TextRecord.checkout):
    Use checkout_as_lines() for default implementation.
  (FullTextRecord.checkout, DeltaTextRecord.checkout):
    Removed to use default implementation.
  (FullTextRecord.checkout_as_lines):
    New method. Just same logic as the past checkout() method.
  (DeltaTextRecord.checkout_as_lines):
    New method. Just same logic as the past checkout() method but
    uses internal lines in rcs_stream instead of its text.
  (_Sink.set_revision_info):
    Record the internal lines in rcs_stream instead of its text
    at revision 1.1.

* cvs2svn_lib/rcs_stream.py
  (RCSStream.__init__):
    Allow to set lines directly in addition to a text.
  (RCSStream.get_lines):
    New method.
Copy link

@ossilator ossilator left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks like a clean solution.

you're now serializing revision contents as lines rather than text (which, btw, means that you need to update some classes' comment). i'd expect this to have a somewhat detrimental effect on database size, and probably also on runtime (more i/o, and serialization with higher granularity, partially offset by saving the repeated splits and joins). did you benchmark this variant? the commit message should document the result.

more fundamentally, i wonder whether this way of dealing with the problem is correct.
did you verify that rcs produces the same output, incl. on intermediate revision, after repeated corruptions of this kind?
the alternative would be dealing with the problem right at the source: if an add not at the end of the text doesn't end with a newline, just add one.

def checkout(self, text_record_db):
"""Return the text for this revision.

Just same as checkout_as_lines() but returns text as flat text string."""

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"just same as" is a questionable construct. "same as" or "just as".
you're making the same mistake in the commit message.

also, add comma before "but".

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for pointing it out. I corrected it and looked over comments in this file and make a new commit eb5aab7.

* test-data/gh-issue-18-irregular-rcsfile-cvsrepos/irregular.txt,v
 - Insert a broken delta in r1.5
 - Add a branch revision r1.2.1.1, branched from corrupted revision r1.2.
@futatuki
Copy link
Author

Thank you for the review.

you're now serializing revision contents as lines rather than text (which, btw, means that you need to update some classes' comment). i'd expect this to have a somewhat detrimental effect on database size, and probably also on runtime (more i/o, and serialization with higher granularity, partially offset by saving the repeated splits and joins). did you benchmark this variant? the commit message should document the result.

I think amount of the increase of the resouce consumption is limited by the number of branches, FullTextRecord for rev 1.1 in trunk and CheckedOutTextRecord for each branching point revision in other branches. Unfortunately I don't have suitable example of such a repo that have many branches compared with total revision numbers and each files in the repos are large enough. So I didn't any benchmark it.

more fundamentally, i wonder whether this way of dealing with the problem is correct.
did you verify that rcs produces the same output, incl. on intermediate revision, after repeated corruptions of this kind?

Yes, I compared with RCS output in some examples including test-data/gh-issue-18-irregular-rcsfile-cvsrepos/irregular.txt,v (this contains 2 corrupt delta and a branch revision from corrupted revision).

the alternative would be dealing with the problem right at the source: if an add not at the end of the text doesn't end with a newline, just add one.

The problem is that we don't know how it was corrupted and what it should be. So I tried to imitate how RCS/CVS deal it.

@futatuki
Copy link
Author

(this contains 2 corrupt delta and a branch revision from corrupted revision).

This was not true... It contains only 1 corrupt delta. One more corrupt delta was lost, perhaps by wrong git stash operation. (So the commit log of 5d1a8df is incorrect).

Previous commit did not contain new delta r1.5. This is a correction.
@futatuki
Copy link
Author

Then I've added the lost corrupted delta.

@futatuki
Copy link
Author

I think amount of the increase of the resouce consumption is limited by the number of branches, FullTextRecord for rev 1.1 in trunk and CheckedOutTextRecord for each branching point revision in other branches.

I'm sorry, I missunderstood. On intermediate revison nodes also use CheckedOutTextRecord after checked it out, because there still exist atleast one newer revision to refer it. So it does not need so special example to measure its effect. (although I still don't have a repos example which has suitable size...)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants