Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better behavior with unchanged files #328

Closed
mattmccutchen-cci opened this issue Nov 5, 2020 · 13 comments · Fixed by #412
Closed

Better behavior with unchanged files #328

mattmccutchen-cci opened this issue Nov 5, 2020 · 13 comments · Fixed by #412
Assignees
Labels

Comments

@mattmccutchen-cci
Copy link
Member

mattmccutchen-cci commented Nov 5, 2020

Normally, 3c prints the rewritten version of a file foo.c to standard output, or if -output-postfix=checked is passed, the rewritten version is written to foo.checked.c. However, if the rewritten version is identical to the original, then:

  • Without -output-postfix, the output to standard output is empty. This edge case is bound to cause scripts to behave unexpectedly if they think the result of rewriting is actually an empty file. John pointed out that many of the regression tests take advantage of this behavior to assert that a file needs no rewriting by checking that the output is empty (look for count 0), but we think the tests could just as easily (and more clearly!) cmp to the original file.
  • With -output-postfix, the foo.checked.c file is not created. Hopefully scripts not designed for this case will just error out instead of doing something wrong. Some users might consider this useful and there might be a performance benefit in some workflows, but there is at least one other reason why a given file might not be written (-base-dir), so if we keep the behavior, we should tell the user why a given file wasn't written. Update 2021-02-13: We've deprioritized that part because it's now an error if 3C generates a change to a file outside the base dir (Properly constrain non-canWrite files to not change (remaining cases) #387), so we're not aware of any other common reason for not writing a file that would need to be distinguished from the file being unchanged; if one arises, we'll try to make sure that that one generates a distinctive message. And the code that writes the files only has a list of modified rewrite buffers; I'm not aware of an easy way to get a list of files that were read but not modified.

This issue is to decide whether to stop treating unchanged files specially in one or both cases (or maybe make it an option), and if we keep the special treatment with -output-postfix, provide better diagnostics.

Note that even if we write unchanged files passed on the command line (by default or as an option), we might not want to write unchanged files included from the -base-dir, at least without an additional option.

@mattmccutchen-cci
Copy link
Member Author

Today I learned that there's at least one case in which 3c prints the rewritten version even when it is identical to the original: see this RUN command in the definedType test.

@mattmccutchen-cci
Copy link
Member Author

Today I learned that there's at least one case in which 3c prints the rewritten version even when it is identical to the original: see this RUN command in the definedType test.

I asked in today's meeting if anyone had more information about this. Mike said the inconsistent behavior should be considered a bug. But if we decide to always output unchanged files, we wouldn't have to do the work to investigate where the current inconsistency is coming from. Should we do that or is it useful to not output unchanged files? (In terms of workflow, do users like to use -output-postfix and run git status and look for untracked files, or would they be just as happy diffing against the output dir once #347 is implemented or using a tool that moves all the written files into place and runs a git diff?) @mwhicks1 suggested I file a new issue, but I think it makes sense to just discuss the topic in this one.

@mwhicks1
Copy link
Member

I would be OK with outputting the new file every time. We could also output a diagnostic under a "verbose" flag that says the content of the file was not changed. This diagnostic can be ignored most of the time (i.e., the flag is not on by default).

@mattmccutchen-cci
Copy link
Member Author

Thanks. Per the last sentence that I recently added to the initial comment: are you OK with outputting all header files in the base dir (and thus allowed to be written by 3C) that are transitively included from the files on the command line, even if there are no changes to make to them? That could be a huge number of unchanged files. But if the user has a script to diff or whatever, a larger number of files may not be a problem.

@mwhicks1
Copy link
Member

Hm, good point.

After a little more thought, I feel that this is not an issue worth pursuing. We are spending a lot of time on what seems to be to be a minor problem. Whether the file is output or not when there is no change, you have to deal with it in post-processing. To me, to avoid clutter to the file system, I lean to the current behavior.

Let's just implement the "output directory" feature and aim to only output files that changed, as we have now. If occasionally one is output that should not be, we can decide whether it rises to the level of a real problem and fix it then.

@mattmccutchen-cci
Copy link
Member Author

Sounds reasonable for the modes that create files (-output-postfix and the forthcoming -output-dir). If our attitude is that for a file that needs no change, either creating it or not is a valid behavior, then I'll make the new regression test script (#355) accept both outcomes for all tests rather than hard-coding which tests trigger one or the other based on the current quirks of 3C.

For the mode that outputs to stdout, I still think it's bad that a script can suddenly start to silently exhibit incorrect behavior if the file is unchanged and the script assumes it is supposed to be empty. I might even be in favor of deprecating the mode that outputs to stdout altogether, since nearly all real users will have multi-file programs (which this mode can't handle in general) and the regression tests will no longer use this mode. But if you think this is low priority, let me know and we can mark the issue as such.

@mwhicks1
Copy link
Member

For the output mode to stdout, I can be persuaded that we should always output the result. It's surprising, when running this mode, and nothing is emitted, I'll agree.

@mattmccutchen-cci
Copy link
Member Author

mattmccutchen-cci commented Jan 19, 2021

Updates: At Thursday's meeting, some team members were emphatic that:

  1. We should keep stdout mode for quick tests even if it isn't very general, and the implementation should remain in the tool rather than a wrapper script: it will be more convenient for users, and we suspect a second code path in the tool may actually be easier to maintain than a wrapper script.
  2. Stdout mode should have an option to solve for changes to multiple files but output only the new version of the main file, discarding the changes to the other files. I'm not convinced that this is a sane use case, but I'll defer to the team. The team allowed that we could require a flag to opt in to this arguably surprising behavior.

Outstanding design questions (stretching the scope of this issue a bit, but I think it's a lesser evil than filing a separate one):

  • What to do in stdout mode if the "solve for and discard changes to other files" flag isn't passed? Solve only for changes to the main file by default, or refuse to run unless the user passes a flag to choose one of the two behaviors? One possible design is to reuse the -base-dir flag (which controls what files we solve for) to distinguish the two behaviors. Currently, -base-dir always defaults to the working directory. We could change things so that in stdout mode, the default is to solve only for the main file, but if -base-dir is set (whether to the working directory or something else), we solve for all files under that directory and error if the main file isn't under the directory, then discard the changes to non-main files.

@mwhicks1
Copy link
Member

I think we care less about (2) than about (1). To me it's fine to just reject rewriting a program in stdout mode if included headers would also have to be changed. I think that addresses your issue?

@mattmccutchen-cci
Copy link
Member Author

To me it's fine to just reject rewriting a program in stdout mode if included headers would also have to be changed. I think that addresses your issue?

Included headers never "have" to be changed: we can always constrain them to not change. (In the worst case, this will just cause the new version of the main file to be identical to the original.) To review, I think we're considering three modes here:

  1. Solve only for changes to the main file, i.e., constrain all other files not to change. This seems like the most straightforward behavior to understand.
  2. Solve for changes to all files under the -base-dir, but only output the new version of the main file and discard any changes to other files. I'm willing to implement this mode if other team members are convinced we should (as they seemed to be on Thursday).
  3. How I interpret your last comment: Solve for changes to all files under the -base-dir but then error if we tried to change a file other than the main file. I don't see the point of this mode compared to mode 1.

@mwhicks1
Copy link
Member

For stdout mode, it should fail if you give it more than one file on the command line.

When you give it one file, the only question is files it #includes. Don't prevent these from changing through constraints. Rather, if you get to the end and it turns out that several files should change per the constraint solution (meaning they should be rewritten, normally) then fail. This is not the same as forcing included files not to rewrite but proceeding anyway, because you will get a different answer for your main file, potentially.

@mattmccutchen-cci
Copy link
Member Author

We discussed the sub-modes of stdout mode in today's meeting: We don't want mode 2 after all, but we do want mode 3 for the case where the user wants to solve for all files under the -base-dir and believes that only one of them will change and wants to assert that that is indeed the case. Sounds reasonable to me; I'll implement it.

Remaining questions: Do we want to support mode 1? (I'm not clear if we previously made a decision about this or, if so, whether the decision stands.) If so, what flag(s) control the choice of mode, and what is the default if none of those flags are passed? In mode 3, is the default -base-dir still the working directory (as it is in non-stdout modes)?

@mattmccutchen-cci
Copy link
Member Author

Update from today's meeting: Stdout mode will support only mode 3 and the default -base-dir will still be the working directory. That makes things simple. In all output modes, specifying a file on the command line that is outside the base dir will be an error, but the user may not have passed the -base-dir option and may not know what the base dir is, so we'll need the error message in that case to be good; we hope that case will be rare.

mattmccutchen-cci added a commit that referenced this issue Jan 29, 2021
…nge. (#391)

This PR addresses function and variable declarations (because they are
the most obvious case and reasonably straightforward) and checked
regions (because they came up in some existing regression tests). We'll
leave #387 open for the tail of unhandled cases.

Also:

- When 3C tries to change a non-writable file, issue an error diagnostic
  (not an assertion failure because there are known unhandled cases)
  rather than silently discarding the change.

- Add a `-dump-unwritable-changes` flag to the `3c` tool to dump the new
  version of the file when that diagnostic appears.

- Add an error diagnostic when 3C tries to change a file under the base
  dir other than the main file in stdout mode. This is a separate
  feature (part of #328) but ended up being easy to implement along with
  the diagnostic for a non-writable file.

- Add tests for all the fixes (but not `-dump-unwritable-changes`).

- Fix a bug where `-warn-all-root-cause` didn't work without
  `-warn-root-cause`, because this affected one of the new tests. The
  use of `-warn-all-root-cause` without `-warn-root-cause` in the
  affected test will serve as a regression test for this fix as well.

Fixes part of #387 and a few unrelated minor issues.
@mattmccutchen-cci mattmccutchen-cci self-assigned this Feb 2, 2021
mattmccutchen-cci added a commit that referenced this issue Feb 13, 2021
- Add -output-dir option to write updated files to a directory structure
  parallel to the base dir (#347). When -output-dir is used, a source
  file outside the base dir can't be handled because there is no way to
  compute its output path. For consistency, this is now an error even
  when -output-dir is not used.

- Convert all 3C regression tests from -output-postfix to -output-dir to
  avoid leaving temporary files in the clang/test/3C directory (#378).

- Expand "3c -help" documentation. In particular, direct the user to pass
  "--" when they don't want to use a compilation database to avoid
  accidentally using unwanted compiler options and suppress the warning
  if no compilation database is found (#343).

- For consistency, have stdout mode output the main file even if it is
  unchanged (#328).

- Fix bugs in matching of file paths against the base dir (#327).

- Other minor bug fixes: see the pull request description for details.

Co-authored-by: John Kastner <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants