Improvements to C frontend#2811
Conversation
There was a problem hiding this comment.
Pull request overview
This PR improves the C/C++ frontend’s parsing and symbol resolution in edge cases involving unary pointer operators, and introduces an opt-in mechanism to inject compiler built-ins to avoid bogus inferred types.
Changes:
- Fix unary
&/*handling to avoid leaking operator names intoPointerReference/PointerDereferenceand to prevent duplicate subtrees. - Add a regression test + C fixture covering the
*_p++ = _cpattern and ensuring no phantom"++"variable is inferred. - Add an opt-in
TranslationConfiguration.injectCompilerBuiltinsflag and wire it into CDT scanner setup (plus new test resources intended for system-include scenarios).
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| cpg-language-cxx/src/main/kotlin/de/fraunhofer/aisec/cpg/frontends/cxx/ExpressionHandler.kt | Adjusts unary &/* creation to reuse the handled operand and only derive names from real Reference operands. |
| cpg-language-cxx/src/main/kotlin/de/fraunhofer/aisec/cpg/frontends/cxx/CXXLanguageFrontend.kt | Switches to ExtendedScannerInfo and adds optional include-file injection for compiler built-ins via a temp-extracted header. |
| cpg-core/src/main/kotlin/de/fraunhofer/aisec/cpg/TranslationConfiguration.kt | Adds injectCompilerBuiltins configuration flag with builder support and documentation. |
| cpg-language-cxx/src/test/kotlin/de/fraunhofer/aisec/cpg/frontends/cxx/CXXExpressionTest.kt | Adds a regression test ensuring *_p++ doesn’t produce a phantom global "++" variable and preserves naming for &x. |
| cpg-language-cxx/src/test/resources/c/deref_postincrement.c | Adds a C regression fixture for the unary-operator/name-leak bug. |
| cpg-language-cxx/src/test/resources/c/system_include/hello.c | Adds a simple C source fixture under system_include/. |
| cpg-language-cxx/src/test/resources/c/system_include/clang_predefines.txt | Adds a large clang predefines fixture under system_include/. |
Follow-up to "Improvements to C frontend" that removes the last sources of noise when the C frontend pulls in a full system-header chain (verified against the macOS SDK's stdio.h in the new CSystemIncludeTest, which drops from 5 inferred declarations + 3 Inference ERRORs to a single legitimate opaque forward-decl and 0 error/warn logs). - SymbolResolver: skip empty-named references. PointerDereference / PointerReference deliberately carry no name when wrapping a non-Reference operand (e.g. `*(_p++)`); running symbol resolution on them was inferring phantom empty-named globals. - SymbolResolver.getConstructorDeclaration: only synthesize an inferred constructor when the language has classes. C's Construction expressions no longer produce `struct::struct` methods that no real callsite ever invokes. - DeclaratorHandler: same for the implicit default constructor added when a composite type has no explicit ctor — gate on HasClasses so plain C structs stay ctor-free. - TypeResolver.handleType: fix an operator-precedence bug where `A && B || C` admitted non-ObjectTypes with GUESSED origin into record inference, producing spurious "non-object type" errors. - tryRecordInference: early-return null on non-ObjectType inputs so opportunistic best-guess callers don't reach the defensive error path in inferRecordDeclaration. - Inference.inferRecordDeclaration: refuse to synthesize a Record for an empty-named ObjectType (typedef target of an anonymous struct that already got its name transferred), and demote the non-object branch to a debug log now that the caller-side guard makes it unreachable via normal flow. - CXXLanguageFrontend / TranslationConfiguration: opt-in `injectCompilerBuiltins` flag that prepends a cpg_builtins.h prelude via CDT's ExtendedScannerInfo, mirroring `clang -include`. The prelude teaches CDT about `__builtin_va_list` (typedef `char *` on arm64-apple-darwin per clang's own AST) so it resolves through _types.h without being synthesized as a bogus struct. Off by default because it adds one implicit declaration per TU. - CSystemIncludeTest: new @EnabledOnOs(OS.MAC) test that pulls in the full macOS SDK stdio.h chain with a checked-in clang_predefines.txt snapshot, and asserts that printf resolves to a real declaration, no references are left unresolved, and the only inferred declaration is `struct __sFILEX` (Apple's private ABI opaque forward-decl in _stdio.h whose body ships only inside libc). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Broadens the macOS SDK include chain the test exercises to also cover <stdlib.h>, <string.h> and <math.h>: hello.c now calls malloc/strlen/strcpy/free/sqrt in addition to printf, and each of those is asserted to resolve to a real, non-inferred declaration from its owning header. Still 0 unresolved refs and 0 error/warn log lines. - cpg_builtins.h: add clang implicit integer typedefs (__int128_t / __uint128_t, approximated as `long long` / `unsigned long long` since CDT does not understand the `__int128` keyword), the __builtin_bswap32/64 + __builtin_fabs*/inf* prototypes referenced by the math.h chain, and a lossy `_Float16` alias to `float` for the bounds-checking helpers. - clang_predefines.txt: drop `__BLOCKS__` — enabling it pulls in atexit_b / bsearch_b / qsort_b prototypes that use Apple's `^` closure syntax, which CDT can't parse. We lose visibility on the `_b`-suffixed functions in exchange for a clean parse. - CSystemIncludeTest: verify each expected system call (printf/malloc/free/strlen/strcpy/sqrt) resolves to a non-inferred declaration, and expand the inferred-decl allow-list with `_malloc_zone_t` (Apple's opaque libmalloc handle) plus the parameter-name-as-type CDT quirk we hit on radixsort/strtonum (__endbyte/__minval/__maxval). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The snapshot was captured from `clang -E -dM -x c /dev/null` on `arm64-apple-darwin`; the file's contents are only meaningful for that tuple. Reflect that in the path and gate: - Predefines file moves from `c/system_include/clang_predefines.txt` to `c/system_include/darwin-arm64/clang_predefines.txt`, leaving room for a sibling `darwin-x86_64/` or other combinations later. - `macOSSdkAvailable` now also requires the current JVM's os.arch to be Apple silicon (aarch64/arm64), so the test skips cleanly on an x86_64 Mac rather than feeding arm64 predefines into an effectively different ABI parse. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Turns the previously test-only prelude into a first-class opt-in
API surface, and moves the clang preprocessor predefines snapshot
out of the test tree into a per-(os, arch) resource bundle inside
the frontend so any consumer can enable it.
- API: `injectCompilerBuiltins(Boolean)` becomes
`injectCompilerBuiltins(target: String?)` — pass a target name
such as `"darwin-arm64"` (== the folder name under `builtins/`)
to enable, or `null` (default) to keep it off. An unknown target
logs a warning and skips injection instead of failing.
- Resource layout:
cpg-language-cxx/src/main/resources/.../frontends/cxx/builtins/
common.h # cross-target: __builtin_bswap*,
# __builtin_fabs*, __builtin_inf*,
# _Float16, __int128_t typedefs
darwin-arm64/cpg_builtins.h # all 478 clang predefines +
# `typedef char *__builtin_va_list`
# + `#include <common.h>`
The target file `#include <common.h>`s the shared pieces; the
frontend adds the extracted `builtins/` root to the include
paths so the angle-bracket lookup resolves.
- CXXLanguageFrontend: extracts the whole `builtins/` tree to a
temp directory once per JVM (deleted on exit) since CDT can only
open real files. The list of resources to extract lives next to
the extractor as a companion-object constant; adding a new
target (e.g. `linux-x86_64/cpg_builtins.h`) is a one-line append
plus the resource file.
- CSystemIncludeTest: drops the `.symbols(parseClangPredefines(...))`
plumbing and the `clang_predefines.txt` fixture — everything is
now inside the frontend's resource tree.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Clarify the comment above the `unresolved` filter: the SymbolResolver does dispatch PointerDereference / PointerReference to handleReference (they're Reference subclasses); the empty-name guard is what short- circuits the `*(_p++)` / `&(a + 1)` cases and leaves refersTo null. The filter exists so the assertion doesn't fire on that legitimate no-op. - Use `kotlin.test.Test` for consistency with the rest of the package (e.g. CXXExpressionTest). The Jupiter engine still picks up @EnabledOnOs / @EnabledIf. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
| // We also reuse the already-computed `input` node — calling handle(ctx.operand) a | ||
| // second time creates a duplicate, orphaned subtree. | ||
| val operand = input | ||
| val refName = (operand as? Reference)?.name |
There was a problem hiding this comment.
What if operand is not a reference?
Why do we need the operand variable? Couldn't we directly use input?
| * typedefs / prototypes instead of being inferred as bogus structs. Adds a small number of | ||
| * implicit declarations per TU. | ||
| */ | ||
| fun injectCompilerBuiltins(target: String?): Builder { |
There was a problem hiding this comment.
To be honest, I don't like the idea of making the configuration of the CPG translation even more complicated. We MUST make this significantly easier. In my opinion, it's very difficult to correctly add some code/repository to the analysis without mistakes with the current interface.
Tighten the C frontend's precision when a translation unit pulls in a full system-header chain (verified against the macOS SDK's
stdio.h,stdlib.h,string.h,math.h).Frontend / resolver fixes
ExpressionHandler: for unary*/&, stop leaking the operand's name into the resultingPointerDereference/PointerReferencewhen the operand isn't itself aReference(*(_p++)no longer becomes aphantom
Variable "++"), and reuse the already-handled operand instead of running the handler twice.SymbolResolver: skip empty-named references (they're*(complex_expr)/&(complex_expr)cases with no lookup target) so we don't infer phantom global variables; and only synthesize inferred constructors whenthe language actually has classes.
DeclaratorHandler: stop adding an implicit default constructor to every C struct — plain C has nostruct::structmethod.TypeResolver: fix an operator-precedence bug (A && B || C→A && (B || C)) that let non-ObjectTypes (e.g.FunctionPointerType) reach record inference and log spurious errors.tryRecordInference/Inference.inferRecordDeclaration: early-return on non-ObjectTypeand empty-namedObjectType; demote the previous ERROR log to debug.New opt-in API: compiler-builtins prelude
TranslationConfiguration.injectCompilerBuiltins(target: String?)— pass a target name (e.g."darwin-arm64") to prepend a compiler-intrinsic prelude to every TU (CDTExtendedScannerInfoinclude-files,equivalent to
clang -include). Unknown targets warn and skip.cpg-language-cxx/src/main/resources/.../frontends/cxx/builtins/:common.h— cross-target clang builtins (__builtin_bswap*,__builtin_fabs*,__builtin_inf*,_Float16,__int128_t).darwin-arm64/cpg_builtins.h— all 478 clang preprocessor predefines + arm64__builtin_va_listtypedef +#include <common.h>.BUILTIN_RESOURCESplus the new resource file.Guardrail test
CSystemIncludeTest(@EnabledOnOs(OS.MAC)+ arm64 SDK check): parses ahello.cthat exercisesprintf/malloc/free/strlen/strcpy/sqrtagainst the real macOS SDK include chain, then asserts each ofthose resolves to a real (non-inferred) declaration, no references are left unresolved, and the only inferred declarations are the expected opaque forward-decls (
__sFILEX,_malloc_zone_t) plus three documentedCDT parser quirks.
SymbolResolverwarns, 0Inferenceerrors on the expanded include chain (was 3 errors + 5 phantom decls before).