Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

3C loses struct definition as part of a field nested in another struct definition #531

Closed
mattmccutchen-cci opened this issue Apr 7, 2021 · 11 comments · Fixed by #657
Closed
Assignees
Labels
benchmark failure A bug causing a failure in our nightly benchmark tests bug Something isn't working

Comments

@mattmccutchen-cci
Copy link
Member

Example reduced from libarchive:

struct outer {
  struct inner {
    int x;
  } i_arr[1];
};

int foo(struct outer *o) {
  return o->i_arr[0].x;
}

3c -alltypes produces:

struct outer {
  struct inner i_arr _Checked[1];
};

int foo(_Ptr<struct outer> o) {
  return o->i_arr[0].x;
}

We lost the definition of struct inner, so we get the following compile error:

nested-struct.checked.c:2:30: error: array has incomplete element type 'struct inner'
  struct inner i_arr _Checked[1];
                             ^

I believe this also explains at least some of the "incomplete definition of type" errors in libarchive, though I haven't confirmed that. I haven't looked into whether there are other variants of the example that trigger the bug without using an array.

It looks like 3C has code somewhere that splits struct inner { ... } i_arr[1]; into struct inner { ... }; and struct inner i_arr[1]; when it occurs at the top level, but this doesn't work for nested structs.

@mattmccutchen-cci mattmccutchen-cci added bug Something isn't working benchmark failure A bug causing a failure in our nightly benchmark tests labels Apr 7, 2021
@john-h-kastner
Copy link
Collaborator

I haven't looked into whether there are other variants of the example that trigger the bug without using an array.

Yep

struct outer {
  struct inner {
    int x;
  } *i;
};

becomes

struct outer {
  _Ptr<struct inner> i;
};

@mattmccutchen-cci mattmccutchen-cci changed the title 3c -alltypes loses struct definition as part of an array field nested in another struct definition 3C loses struct definition as part of a field nested in another struct definition Apr 7, 2021
@mattmccutchen-cci
Copy link
Member Author

I updated the issue title to reflect that there are examples that do not require array fields and thus do not require -alltypes.

@kyleheadley kyleheadley self-assigned this Apr 7, 2021
@kyleheadley
Copy link
Member

This problem may be worse that we thought:

struct outer {
  struct inner {int x;} *i;
} a[4];

converts to

struct outer {
  _Ptr<struct inne};
struct outer a _Checked[4];

which isn't even correct syntax, or use of the name "inner".

@mattmccutchen-cci
Copy link
Member Author

mattmccutchen-cci commented Apr 9, 2021

It looks like several other error message templates that currently occur in our benchmarks may be caused by this issue when the nested struct type has no name (not to be confused with an "anonymous" struct where the variable has no name and the fields enter the surrounding scope). With this slight modification of the original example:

struct outer {
  struct inner {
    int x;
  } i_arr[1];
};

int foo(struct outer *o) {
  return o->i_arr[0].x;
}

3c -alltypes gives:

struct outer {
  // Now it looks like `i_arr` is the type name rather than the field name, so we get:
  // error: expected member name or ';' after declaration specifiers
  // error: array has incomplete element type 'struct i_arr'
  struct i_arr _Checked[1];
};

int foo(_Ptr<struct outer> o) {
  // error: no member named 'i_arr' in 'struct outer'
  return o->i_arr[0].x;
}

@mattmccutchen-cci
Copy link
Member Author

mattmccutchen-cci commented Apr 9, 2021

Summary of discussion from today's meeting:

I suggested that one approach we could take that should completely avoid this problem is to move all struct definitions to the top level and leave the field declarations just referring to them by name. Kyle pointed out that 3C tries to minimize its edits to the user's code and this would be a dramatic edit, but Mike says that nested struct definitions could be considered a poor coding practice that 3C is justified in preemptively cleaning up.

As Kyle pointed out, the next implementation problem is that the Clang API makes it straightforward for us to get the proper source locations to rewrite the inner struct fields in place or to move the entire inner struct definition as a chunk of text, but it would be tricky to do both in the same rewriting pass. I proposed a workaround of performing an initial pass that just moves all struct definitions to the top level and then re-parsing all the source files so we have the correct source locations to rewrite the fields of the moved struct definitions. This could all be done within a single run of the 3C executable. (If we wanted, 3C could avoid moving structs that it won't need to rewrite by solving for the annotations, figuring out which structs it will need to rewrite, moving them, and then starting over.)

Further ideas from me after the meeting:

We could consider additionally addressing #542 by inventing names for unnamed struct definitions and moving them in the same way.

But I wonder if we could just fix the rewriter to bring the struct body along with the type name wherever it needs to go in the rewritten code for the first variable declaration. For a multi-variable declaration, the subsequent variables can reference the struct type by name. For example, 3C could convert this:

struct mystruct {
  int *x;
} (*f)(struct { int *y; } *p), b;

to this:

_Ptr<struct mystruct {
  _Ptr<int> x;
} (struct { _Ptr<int> y; } *p)> f;
struct mystruct b;

Yes, it's valid Checked C to define a named struct type inside a function pointer type like this! It looks crazy, but the user is the one who decided to combine the struct and variable definitions in the first place. 🙂 This could work the same way whether the code is at the top level or nested in an outer struct. The same approach would work for single-variable declarations using unnamed struct types (in effect, automating in 3C what the 3C warning in #542 currently tells the user to do manually) and even for multi-variable declarations if we invent a name for the struct type.

Update: I realize that the "keep the body with the name" approach has a similar problem as the "move to the top level" approach in that we may have to rewrite the fields of the struct as well as the surrounding declaration, and it's unclear whether it would be feasible to break this into multiple passes as I proposed for the "move to the top level" approach. In simpler examples like the one above, we might be able to fiddle with the syntax surrounding the struct definitions while preserving the source locations of the struct fields so we can rewrite them in parallel. (In this case, we would add the _Ptr< before struct mystruct, delete the (*f), and add the > f after *p).) But this breaks down in extreme cases with multi-level function pointers, in which we may need to reverse the order of the inner and outer parameter lists. For example, this:

struct mystruct2 {
  int *x;
} (*(*g)(struct { double *z; } *q))(struct { int *y; } *p);

might have to become this:

_Ptr<_Ptr<struct mystruct2 {
  _Ptr<int> x;
} (struct { _Ptr<int> y; } *p)> (struct { _Ptr<double> z; } *q)> g;

Note that the order of y and z was reversed and we also rewrote both of their types. Realistically, if we want to produce valid output for these extreme inputs, we probably need to implement the ability to move struct definitions to the top level, and then it may not be worth the extra work to implement the "keep the body with the name" approach as an alternative for the simpler inputs.

@mattmccutchen-cci
Copy link
Member Author

The latest plan is that I'm going to work on de-nesting structs, not Kyle.

@kyleheadley
Copy link
Member

It's been a while, but I'm just seeing the above update now, so I just wanted to chime in to say that depending on how much work it is, we could "keep the body with the name" and ignore the complex fp example as uncommon code. I prefer smaller diffs and we don't need to support every possibility (though I would like to if we had the time).

But I don't know which solution you favor or how much work they will be.

@mattmccutchen-cci
Copy link
Member Author

Our current plan after discussion at today's meeting is essentially as I previously suggested: Move all structs to the top level (or if they are inside a function, maybe to the top level of the function instead) as a first pass, then if any structs actually needed to be moved, re-parse the ASTs before proceeding with the normal inference. Implement struct de-nesting first and then decide whether to change ordinary multi-decl rewriting to use it (see #647). 3C would accept a flag to stop after struct de-nesting (and any similar refactorings we implement in the future) and not do inference.

The re-parsing will have some performance cost. We hope that users who are committed to porting their program will be willing and able to accept the struct de-nesting refactoring up front (assuming we can make it reliable enough); then subsequent 3C runs will detect that no structs need to be de-nested and the performance cost would be minimal. However, users who want to make small changes to their original program and see the corresponding changes to the 3C output would have to pay the performance cost repeatedly.

An additional thought from me: If there are multiple levels of nested structs:

struct A {
  struct B {
    struct C { ... } c;
  } b;
};

The most straightforward approach would be to move only the second-level structs such as B (along with their descendants, if any) to the top level, then re-parse the ASTs and repeat until there are no nested structs left. If we wanted to do all levels of de-nesting in a single pass, we would have to figure out the correct text to insert at the top level for an intermediate struct such as B, with all further nested struct definitions removed: struct B { struct C c; };. This might be manageable if we cared enough; certainly it would be more manageable than performing arbitrary rewrites to field types in the same pass as de-nesting a struct.

@mattmccutchen-cci
Copy link
Member Author

I had a new idea about the general problem of moving a code block when rewrites also need to be made inside of it. This problem comes up both when de-nesting three or more levels of structs as in my previous comment (there are actually a few examples of this in libarchive) and if we want to de-nest structs and perform the normal inference in the same run of the 3c tool.

My original proposed approach of re-parsing the ASTs has a number of downsides. A new one I realized is that 3C might issue a diagnostic against an intermediate version of a file that differs from either version available to the user (the original or the final version in the output dir). This would be confusing, though I don't know how often it would happen in practice and how confusing it would actually be. And it would complicate the integration with the diagnostic verifier, for what that's worth. More broadly, if something goes wrong in producing an intermediate version of a file and that leads to later problems, it might be more work to troubleshoot; we'd need an option to dump out all intermediate file versions.

My new idea is to buffer all the requested rewrites and then have code that looks at the source ranges to find an order in which to perform the rewrites so that all rewrites inside a code block are done before the block is moved. When we move the block, we call Rewriter::getRewrittenText to get the text including the rewrites that were made inside. Then we can do the entire analysis based on the original ASTs, which would make it easier (for example) to decide whether to de-nest a struct based on whether a pointer type in the containing multi-decl actually needed to be changed, if we wanted. The downsides I see of this approach:

  1. If we can't find a valid order in which to perform the rewrites, we are stuck. I suppose we could fall back to the AST re-parsing approach when that happens, though that would mean maintaining both code paths. I think the ordering will be OK with 3C's current feature set, but I can't account for all changes that could conceivably be made to 3C in the future.
  2. Currently, rewriteSourceRange knows whether the rewrite succeeded; it doesn't return that information to the caller, but it could if we had a need for it. Under the new approach, rewriteSourceRange would only buffer the rewrite, and we wouldn't find out until later whether it succeeds. I hope it will remain the case that we don't need the information at the time of the rewriteSourceRange call, but I can't be sure.
  3. Supposedly, Rewriter::getRewrittenText can be slow on a large file because it iterates through the entire file one character at a time. If performance becomes a problem in 3C, it looks to me that this would be easy to fix if we could modify the code, but I don't know whether we would be able to either get a patch accepted to upstream Clang or get Microsoft to agree to a change to existing Clang code that isn't directly related to Checked C functionality. The only other option I see would be to essentially fork the entire Rewriter code into 3C.

Any thoughts on which approach is better? If I don't get any feedback, I'll plan to try the new one first. Migrating all of 3C to buffer rewrites might become a somewhat invasive change, but I can start by implementing it only for multi-level struct de-nesting and see how that goes.

@kyleheadley
Copy link
Member

I'd prefer to avoid the temporary files if possible.

I don't know if you checked, but can you add text multiple times at the same location to get multiple lines added?
You could determine the order first (or use a recursive breadth-first search), and add each unit to the same place to avoid getting rewritten text. Just my first thought.

@mattmccutchen-cci
Copy link
Member Author

I'd prefer to avoid the temporary files if possible.

You mean you prefer the second approach? Great, then I will try it.

I don't know if you checked, but can you add text multiple times at the same location to get multiple lines added?

Yes, that works: I have seen it happen in my tests so far on libarchive.

You could determine the order first (or use a recursive breadth-first search), and add each unit to the same place to avoid getting rewritten text. Just my first thought.

I don't understand how this relates to the problem I described. In the example:

struct A {
  struct B {
    struct C { ... } c;
  } b;
};

3C would generate two rewrites:

  1. struct B { ... } should be moved before struct A and replaced with struct B.
  2. struct C { ... } should be moved before struct A and replaced with struct C.

Maybe your point was that we need to ensure that struct C ends up before struct B in the file so that C is complete at the point it is used as the type of a field of B? This is an important point that I hadn't thought about. However, my focus here was on how to ensure that the version of struct B { ... } that we move takes into account the rewrite to the struct C { ... } inside. That is, we want to end up with the following:

struct C { ... };
struct B {
  struct C c;
};
struct A {
  struct B b;
};

not the following, which is what we would get if we copied the content of the source range for B without rewrite (2) having been applied:

struct C { ... };
struct B {
  struct C { ... } c;
};
struct A {
  struct B b;
};

Either the AST re-parsing approach or the approach that ensures rewrite (2) is performed first and then uses Rewriter::getRewrittenText to get the new version of B would solve this problem.

mattmccutchen-cci added a commit that referenced this issue Jul 21, 2021
multi-decl rewriting splits them rather than losing them.

Fixes #531
except for a "declaration does not declare anything" compiler warning
that will be addressed by de-nesting the inline structs.
@mattmccutchen-cci mattmccutchen-cci linked a pull request Jul 29, 2021 that will close this issue
11 tasks
mattmccutchen-cci added a commit that referenced this issue Sep 1, 2021
multi-decl rewriting splits them rather than losing them.

Fixes #531
except for a "declaration does not declare anything" compiler warning
that will be addressed by de-nesting the inline structs.
mattmccutchen-cci added a commit that referenced this issue Sep 17, 2021
multi-decl rewriting splits them rather than losing them.

Fixes #531
except for a "declaration does not declare anything" compiler warning
that will be addressed by de-nesting the inline structs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
benchmark failure A bug causing a failure in our nightly benchmark tests bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants