Skip to content

Improvements to C frontend#2811

Open
oxisto wants to merge 6 commits into
mainfrom
oxisto/c-fixes
Open

Improvements to C frontend#2811
oxisto wants to merge 6 commits into
mainfrom
oxisto/c-fixes

Conversation

@oxisto

@oxisto oxisto commented Jul 2, 2026

Copy link
Copy Markdown
Member

Tighten the C frontend's precision when a translation unit pulls in a full system-header chain (verified against the macOS SDK's stdio.h, stdlib.h, string.h, math.h).

Frontend / resolver fixes

  • ExpressionHandler: for unary * / &, stop leaking the operand's name into the resulting PointerDereference / PointerReference when the operand isn't itself a Reference (*(_p++) no longer becomes a
    phantom Variable "++"), and reuse the already-handled operand instead of running the handler twice.
  • SymbolResolver: skip empty-named references (they're *(complex_expr) / &(complex_expr) cases with no lookup target) so we don't infer phantom global variables; and only synthesize inferred constructors when
    the language actually has classes.
  • DeclaratorHandler: stop adding an implicit default constructor to every C struct — plain C has no struct::struct method.
  • TypeResolver: fix an operator-precedence bug (A && B || CA && (B || C)) that let non-ObjectTypes (e.g. FunctionPointerType) reach record inference and log spurious errors.
  • tryRecordInference / Inference.inferRecordDeclaration: early-return on non-ObjectType and empty-named ObjectType; demote the previous ERROR log to debug.

New opt-in API: compiler-builtins prelude

  • TranslationConfiguration.injectCompilerBuiltins(target: String?) — pass a target name (e.g. "darwin-arm64") to prepend a compiler-intrinsic prelude to every TU (CDT ExtendedScannerInfo include-files,
    equivalent to clang -include). Unknown targets warn and skip.
  • Resource layout under cpg-language-cxx/src/main/resources/.../frontends/cxx/builtins/:
    • common.h — cross-target clang builtins (__builtin_bswap*, __builtin_fabs*, __builtin_inf*, _Float16, __int128_t).
    • darwin-arm64/cpg_builtins.h — all 478 clang preprocessor predefines + arm64 __builtin_va_list typedef + #include <common.h>.
  • Adding another target is a one-line append to BUILTIN_RESOURCES plus the new resource file.

Guardrail test

  • CSystemIncludeTest (@EnabledOnOs(OS.MAC) + arm64 SDK check): parses a hello.c that exercises printf/malloc/free/strlen/strcpy/sqrt against the real macOS SDK include chain, then asserts each of
    those resolves to a real (non-inferred) declaration, no references are left unresolved, and the only inferred declarations are the expected opaque forward-decls (__sFILEX, _malloc_zone_t) plus three documented
    CDT parser quirks.
  • Result: 0 SymbolResolver warns, 0 Inference errors on the expanded include chain (was 3 errors + 5 phantom decls before).

Copilot AI review requested due to automatic review settings July 2, 2026 13:36
@oxisto oxisto marked this pull request as draft July 2, 2026 13:36
@codecov

codecov Bot commented Jul 2, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 53.33333% with 28 lines in your changes missing coverage. Please review.
✅ Project coverage is 72.14%. Comparing base (de5e7b3) to head (411eb9b).
⚠️ Report is 1 commits behind head on main.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
...fer/aisec/cpg/frontends/cxx/CXXLanguageFrontend.kt 32.25% 20 Missing and 1 partial ⚠️
...e/fraunhofer/aisec/cpg/TranslationConfiguration.kt 50.00% 2 Missing ⚠️
...fraunhofer/aisec/cpg/passes/inference/Inference.kt 50.00% 2 Missing ⚠️
...hofer/aisec/cpg/frontends/cxx/ExpressionHandler.kt 80.00% 0 Missing and 2 partials ⚠️
...n/de/fraunhofer/aisec/cpg/passes/SymbolResolver.kt 83.33% 0 Missing and 1 partial ⚠️
Additional details and impacted files
Files with missing lines Coverage Δ
...lin/de/fraunhofer/aisec/cpg/passes/TypeResolver.kt 90.00% <100.00%> (+0.14%) ⬆️
...raunhofer/aisec/cpg/passes/inference/PassHelper.kt 84.32% <100.00%> (+0.11%) ⬆️
...hofer/aisec/cpg/frontends/cxx/DeclaratorHandler.kt 81.40% <100.00%> (ø)
...n/de/fraunhofer/aisec/cpg/passes/SymbolResolver.kt 86.85% <83.33%> (-0.24%) ⬇️
...e/fraunhofer/aisec/cpg/TranslationConfiguration.kt 91.32% <50.00%> (-0.70%) ⬇️
...fraunhofer/aisec/cpg/passes/inference/Inference.kt 76.16% <50.00%> (-1.73%) ⬇️
...hofer/aisec/cpg/frontends/cxx/ExpressionHandler.kt 72.92% <80.00%> (+0.26%) ⬆️
...fer/aisec/cpg/frontends/cxx/CXXLanguageFrontend.kt 73.51% <32.25%> (-3.50%) ⬇️

... and 1 file with indirect coverage changes

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves the C/C++ frontend’s parsing and symbol resolution in edge cases involving unary pointer operators, and introduces an opt-in mechanism to inject compiler built-ins to avoid bogus inferred types.

Changes:

  • Fix unary & / * handling to avoid leaking operator names into PointerReference/PointerDereference and to prevent duplicate subtrees.
  • Add a regression test + C fixture covering the *_p++ = _c pattern and ensuring no phantom "++" variable is inferred.
  • Add an opt-in TranslationConfiguration.injectCompilerBuiltins flag and wire it into CDT scanner setup (plus new test resources intended for system-include scenarios).

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
cpg-language-cxx/src/main/kotlin/de/fraunhofer/aisec/cpg/frontends/cxx/ExpressionHandler.kt Adjusts unary &/* creation to reuse the handled operand and only derive names from real Reference operands.
cpg-language-cxx/src/main/kotlin/de/fraunhofer/aisec/cpg/frontends/cxx/CXXLanguageFrontend.kt Switches to ExtendedScannerInfo and adds optional include-file injection for compiler built-ins via a temp-extracted header.
cpg-core/src/main/kotlin/de/fraunhofer/aisec/cpg/TranslationConfiguration.kt Adds injectCompilerBuiltins configuration flag with builder support and documentation.
cpg-language-cxx/src/test/kotlin/de/fraunhofer/aisec/cpg/frontends/cxx/CXXExpressionTest.kt Adds a regression test ensuring *_p++ doesn’t produce a phantom global "++" variable and preserves naming for &x.
cpg-language-cxx/src/test/resources/c/deref_postincrement.c Adds a C regression fixture for the unary-operator/name-leak bug.
cpg-language-cxx/src/test/resources/c/system_include/hello.c Adds a simple C source fixture under system_include/.
cpg-language-cxx/src/test/resources/c/system_include/clang_predefines.txt Adds a large clang predefines fixture under system_include/.

Comment thread cpg-language-cxx/src/test/resources/c/system_include/hello.c
oxisto and others added 4 commits July 2, 2026 16:16
Follow-up to "Improvements to C frontend" that removes the last
sources of noise when the C frontend pulls in a full system-header
chain (verified against the macOS SDK's stdio.h in the new
CSystemIncludeTest, which drops from 5 inferred declarations + 3
Inference ERRORs to a single legitimate opaque forward-decl and 0
error/warn logs).

- SymbolResolver: skip empty-named references. PointerDereference /
  PointerReference deliberately carry no name when wrapping a
  non-Reference operand (e.g. `*(_p++)`); running symbol resolution
  on them was inferring phantom empty-named globals.
- SymbolResolver.getConstructorDeclaration: only synthesize an
  inferred constructor when the language has classes. C's
  Construction expressions no longer produce `struct::struct` methods
  that no real callsite ever invokes.
- DeclaratorHandler: same for the implicit default constructor added
  when a composite type has no explicit ctor — gate on HasClasses so
  plain C structs stay ctor-free.
- TypeResolver.handleType: fix an operator-precedence bug where
  `A && B || C` admitted non-ObjectTypes with GUESSED origin into
  record inference, producing spurious "non-object type" errors.
- tryRecordInference: early-return null on non-ObjectType inputs so
  opportunistic best-guess callers don't reach the defensive error
  path in inferRecordDeclaration.
- Inference.inferRecordDeclaration: refuse to synthesize a Record for
  an empty-named ObjectType (typedef target of an anonymous struct
  that already got its name transferred), and demote the non-object
  branch to a debug log now that the caller-side guard makes it
  unreachable via normal flow.
- CXXLanguageFrontend / TranslationConfiguration: opt-in
  `injectCompilerBuiltins` flag that prepends a cpg_builtins.h
  prelude via CDT's ExtendedScannerInfo, mirroring
  `clang -include`. The prelude teaches CDT about
  `__builtin_va_list` (typedef `char *` on arm64-apple-darwin per
  clang's own AST) so it resolves through _types.h without being
  synthesized as a bogus struct. Off by default because it adds one
  implicit declaration per TU.
- CSystemIncludeTest: new @EnabledOnOs(OS.MAC) test that pulls in the
  full macOS SDK stdio.h chain with a checked-in
  clang_predefines.txt snapshot, and asserts that printf resolves to
  a real declaration, no references are left unresolved, and the
  only inferred declaration is `struct __sFILEX` (Apple's private
  ABI opaque forward-decl in _stdio.h whose body ships only inside
  libc).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Broadens the macOS SDK include chain the test exercises to also
cover <stdlib.h>, <string.h> and <math.h>: hello.c now calls
malloc/strlen/strcpy/free/sqrt in addition to printf, and each of
those is asserted to resolve to a real, non-inferred declaration
from its owning header. Still 0 unresolved refs and 0 error/warn
log lines.

- cpg_builtins.h: add clang implicit integer typedefs (__int128_t /
  __uint128_t, approximated as `long long` / `unsigned long long`
  since CDT does not understand the `__int128` keyword), the
  __builtin_bswap32/64 + __builtin_fabs*/inf* prototypes referenced
  by the math.h chain, and a lossy `_Float16` alias to `float` for
  the bounds-checking helpers.
- clang_predefines.txt: drop `__BLOCKS__` — enabling it pulls in
  atexit_b / bsearch_b / qsort_b prototypes that use Apple's `^`
  closure syntax, which CDT can't parse. We lose visibility on the
  `_b`-suffixed functions in exchange for a clean parse.
- CSystemIncludeTest: verify each expected system call
  (printf/malloc/free/strlen/strcpy/sqrt) resolves to a
  non-inferred declaration, and expand the inferred-decl allow-list
  with `_malloc_zone_t` (Apple's opaque libmalloc handle) plus the
  parameter-name-as-type CDT quirk we hit on radixsort/strtonum
  (__endbyte/__minval/__maxval).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The snapshot was captured from `clang -E -dM -x c /dev/null` on
`arm64-apple-darwin`; the file's contents are only meaningful for
that tuple. Reflect that in the path and gate:

- Predefines file moves from
  `c/system_include/clang_predefines.txt` to
  `c/system_include/darwin-arm64/clang_predefines.txt`, leaving
  room for a sibling `darwin-x86_64/` or other combinations later.
- `macOSSdkAvailable` now also requires the current JVM's os.arch
  to be Apple silicon (aarch64/arm64), so the test skips cleanly on
  an x86_64 Mac rather than feeding arm64 predefines into an
  effectively different ABI parse.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Turns the previously test-only prelude into a first-class opt-in
API surface, and moves the clang preprocessor predefines snapshot
out of the test tree into a per-(os, arch) resource bundle inside
the frontend so any consumer can enable it.

- API: `injectCompilerBuiltins(Boolean)` becomes
  `injectCompilerBuiltins(target: String?)` — pass a target name
  such as `"darwin-arm64"` (== the folder name under `builtins/`)
  to enable, or `null` (default) to keep it off. An unknown target
  logs a warning and skips injection instead of failing.
- Resource layout:
    cpg-language-cxx/src/main/resources/.../frontends/cxx/builtins/
      common.h                         # cross-target: __builtin_bswap*,
                                       # __builtin_fabs*, __builtin_inf*,
                                       # _Float16, __int128_t typedefs
      darwin-arm64/cpg_builtins.h      # all 478 clang predefines +
                                       # `typedef char *__builtin_va_list`
                                       # + `#include <common.h>`
  The target file `#include <common.h>`s the shared pieces; the
  frontend adds the extracted `builtins/` root to the include
  paths so the angle-bracket lookup resolves.
- CXXLanguageFrontend: extracts the whole `builtins/` tree to a
  temp directory once per JVM (deleted on exit) since CDT can only
  open real files. The list of resources to extract lives next to
  the extractor as a companion-object constant; adding a new
  target (e.g. `linux-x86_64/cpg_builtins.h`) is a one-line append
  plus the resource file.
- CSystemIncludeTest: drops the `.symbols(parseClangPredefines(...))`
  plumbing and the `clang_predefines.txt` fixture — everything is
  now inside the frontend's resource tree.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@oxisto oxisto marked this pull request as ready for review July 3, 2026 07:02
@oxisto oxisto requested a review from Copilot July 3, 2026 07:04

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated 2 comments.

- Clarify the comment above the `unresolved` filter: the SymbolResolver
  does dispatch PointerDereference / PointerReference to handleReference
  (they're Reference subclasses); the empty-name guard is what short-
  circuits the `*(_p++)` / `&(a + 1)` cases and leaves refersTo null.
  The filter exists so the assertion doesn't fire on that legitimate
  no-op.
- Use `kotlin.test.Test` for consistency with the rest of the package
  (e.g. CXXExpressionTest). The Jupiter engine still picks up
  @EnabledOnOs / @EnabledIf.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
// We also reuse the already-computed `input` node — calling handle(ctx.operand) a
// second time creates a duplicate, orphaned subtree.
val operand = input
val refName = (operand as? Reference)?.name

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if operand is not a reference?

Why do we need the operand variable? Couldn't we directly use input?

* typedefs / prototypes instead of being inferred as bogus structs. Adds a small number of
* implicit declarations per TU.
*/
fun injectCompilerBuiltins(target: String?): Builder {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest, I don't like the idea of making the configuration of the CPG translation even more complicated. We MUST make this significantly easier. In my opinion, it's very difficult to correctly add some code/repository to the analysis without mistakes with the current interface.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants