Skip to content

Add MANIFEST.MF fallback parser for JARs without pom.properties#150

Merged
anderruiz merged 3 commits into
mainfrom
ander/jar-manifest-fallback-parser
May 14, 2026
Merged

Add MANIFEST.MF fallback parser for JARs without pom.properties#150
anderruiz merged 3 commits into
mainfrom
ander/jar-manifest-fallback-parser

Conversation

@anderruiz
Copy link
Copy Markdown
Contributor

@anderruiz anderruiz commented May 13, 2026

Summary

  • Adds a MANIFEST.MF fallback path to JarPomPropertiesExtractor for JAR files that lack META-INF/maven/*/pom.properties (e.g. BouncyCastle bcprov-jdk15to18-1.78.1.jar)
  • Implements the Java Tracer's Dependency.java guessFallbackNoPom algorithm: groupId inference via dot-prefix heuristic on Bundle-SymbolicName, version priority (agreement > filename > Bundle-Version > Implementation-Version)
  • artifactId always from filename: Bundle-Name and Implementation-Title are OSGi/JAR display names and are not reliable Maven artifact IDs (e.g. Bundle-Name: bcprov vs filename bcprov-jdk18on); the filename-derived artifact ID is always used
  • Classifier stripping: filenames such as netty-tcnative-boringssl-static-2.0.61.Final-linux-x86_64.jar now yield version 2.0.61.Final instead of including the classifier; known OS/arch (linux, windows, osx, …) and descriptor (sources, javadoc, native, …) classifiers are stripped; Maven qualifiers (-SNAPSHOT, -Final, -RC*) are left untouched
  • Greedy filename regex: uses rightmost -\d boundary so log4j-1.2-api-2.17.1.jar splits as artifactId=log4j-1.2-api, version=2.17.1
  • Automatic-Module-Name fallback for groupId: when Bundle-SymbolicName has no dots (e.g. bcprov), the BSN dot-prefix heuristic cannot infer a groupId; parseGroupID falls back to Automatic-Module-Name, first applying the same dot-prefix heuristic, then stripping the last dot-segment when AMN has >= 2 dots (e.g. org.bouncycastle.provider -> org.bouncycastle)
  • Manifest parsing stops at the first blank line (main-section boundary) so per-entry section headers cannot overwrite main-section values
  • parseManifestAttributes returns a manifestAttrs struct instead of multiple return values
  • Includes 6 new JAR test fixtures and full unit + integration test coverage (50+ new test cases)

Algorithm Details

groupId inference (parseGroupID):

  1. Strips OSGi directives from Bundle-SymbolicName (everything after ;)
  2. Tries candidates [filenameArtifact, cleanName(bundleName)] against BSN dot-prefix heuristic
  3. If BSN has no dots -> tries same heuristic on Automatic-Module-Name, then strips AMN last segment (requires >= 2 dots)
  4. Fallback: groupId = entire BSN

artifactId: always the filename-derived artifact ID; Bundle-Name and Implementation-Title are only used as secondary candidates for groupId inference, never as the artifact ID

Version priority: Bundle-Version == Implementation-Version (high confidence) -> filename version -> Bundle-Version -> Implementation-Version

Filename regex: ^(.+)-(\d.*)\.jar$ (greedy, rightmost split) + classifier stripping via jarClassifierRegex

Real-world result for bcprov-jdk15to18-1.78.1.jar:

  • Before: bcprov:bcprov@1.78.1
  • After: org.bouncycastle:bcprov-jdk15to18@1.78.1

Test plan

  • All existing JAR pom.properties tests pass unchanged (no regression)
  • Integration tests: bouncycastle jdk15to18, singleton directive, version priority, version agreement, no-BSN, pom.properties precedence
  • Unit tests: parseJarFilename (incl. classifier cases), parseManifestAttributes (incl. AMN + per-entry isolation), parseGroupID (incl. AMN fallback), resolveManifestPackage (incl. artifactId always from filename)
  • Full test suite: ./scripts/run_tests.sh passes
  • Lints: ./scripts/run_lints.sh passes

Generated with Claude Code

When a JAR file has no embedded pom.properties, fall back to inferring
Maven coordinates from MANIFEST.MF attributes and the JAR filename.
This replicates the Java Tracer's Dependency.java guessFallbackNoPom
algorithm for groupId inference using the dot-prefix heuristic on
Bundle-SymbolicName, with artifactId and version priority chains.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@anderruiz anderruiz requested a review from a team as a code owner May 13, 2026 15:09
@datadog-prod-us1-3
Copy link
Copy Markdown

datadog-prod-us1-3 Bot commented May 13, 2026

🎯 Code Coverage (details)
Patch Coverage: 87.39%
Overall Coverage: 84.98% (+0.03%)

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: 7ad4cc9 | Docs | Datadog PR Page | Give us feedback!

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 66e2afd451

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +103 to +107
key, value, found := strings.Cut(line, ": ")
if !found {
currentKey = ""

continue
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Stop parsing manifest attributes at the main section

For JARs whose MANIFEST.MF contains per-entry sections after the blank line that ends the main section, this continues parsing those later headers and can overwrite the package-level values collected above. Because the fallback uses these attributes to infer the Maven name/version, a no-pom JAR with class/file-specific Implementation-Title, Implementation-Version, or OSGi headers can be reported as the wrong package; the parser should stop at the first blank line in the manifest main section.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 4e7e90d. The scanner now breaks on the first blank line, which is the JAR spec's main-section terminator. Added a regression test: a manifest with per-entry sections containing a different Bundle-SymbolicName and Implementation-Version — the parser correctly returns only the main-section values.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 4e7e90d. The scanner now breaks on the first blank line, which is the JAR spec's main-section terminator. Added a regression test: a manifest with per-entry sections containing a different Bundle-SymbolicName and Implementation-Version — the parser correctly returns only the main-section values.

Per the JAR spec, a blank line ends the main section of MANIFEST.MF.
Per-entry sections follow and can contain class/file-specific headers
that share attribute names with the main section (e.g. Implementation-Title,
Bundle-SymbolicName). The previous scanner consumed all lines without
stopping, so per-entry headers could overwrite main-section values and
cause a JAR to be reported under the wrong package identity.

Add a break on the first blank line so only main-section attributes
are used for package inference. Add a regression test covering this case.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4e7e90d327

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread pkg/lockfile/java/jar-manifest.go Outdated
// jarFilenameRegex matches Maven-convention JAR filenames: artifactId-version.jar
// The version must start with a digit. The artifactId is the minimal non-greedy match,
// and the version captures everything after the first hyphen-digit boundary through .jar.
var jarFilenameRegex = cachedregexp.MustCompile(`^(.+?)-(\d.*)\.jar$`)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Strip Maven classifiers from fallback versions

When scanning a no-pom JAR whose filename includes a Maven classifier, such as netty-tcnative-boringssl-static-2.0.61.Final-linux-x86_64.jar, this regex captures 2.0.61.Final-linux-x86_64 as the package version. Because resolveManifestPackage prefers the filename-derived version whenever the manifest versions do not both agree, the fallback can emit an SBOM component with a non-existent Maven version even when Bundle-Version contains the real version; classified artifacts are already present in this repo's Maven/Gradle fixtures, so the filename parser should recognize and drop the classifier portion rather than treating it as part of the version.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: db01517de6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread pkg/lockfile/java/jar-manifest.go Outdated
// jarFilenameRegex matches Maven-convention JAR filenames: artifactId-version.jar
// The version must start with a digit. The artifactId is the minimal non-greedy match,
// and the version captures everything after the first hyphen-digit boundary through .jar.
var jarFilenameRegex = cachedregexp.MustCompile(`^(.+?)-(\d.*)\.jar$`)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid splitting artifact IDs at the first digit segment

For no-pom JARs whose artifactId itself contains a hyphen followed by a digit, such as log4j-1.2-api-2.17.1.jar, this non-greedy regex parses log4j as the artifact and 1.2-api-2.17.1 as the version. That filename-derived version is later preferred whenever manifest versions are missing or disagree, and the artifact value is also used for group/name inference, so the fallback can emit a non-existent Maven coordinate for a valid Maven filename. Consider deriving the filename split from the manifest version when available or otherwise testing version boundaries from the right instead of the first -\d boundary.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 75da718 by changing the regex from non-greedy (.+?) to greedy (.+). The greedy match finds the rightmost -\d boundary rather than the leftmost, so log4j-1.2-api-2.17.1.jar now splits as log4j-1.2-api + 2.17.1. Added a test case to cover this pattern.

@anderruiz anderruiz force-pushed the ander/jar-manifest-fallback-parser branch from db01517 to 75da718 Compare May 13, 2026 15:51
@anderruiz
Copy link
Copy Markdown
Contributor Author

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 75da7188fc

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread pkg/lockfile/java/jar-manifest.go Outdated
Comment on lines +247 to +248
case cleanedBundleName != "" && cleanedBundleName != bsn:
artifactID = cleanedBundleName
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve the filename artifactId for manifest fallback

For JARs whose manifest Bundle-Name is only a display/generic name, this branch emits the wrong Maven coordinate by replacing the filename artifactId. The new bcprov-jdk18on-1.78.1.jar fallback fixture itself demonstrates the problem: Maven Central identifies it as org.bouncycastle:bcprov-jdk18on:1.78.1, but this code returns org.bouncycastle:bcprov:1.78.1, so SBOM/PURL vulnerability matching will look up a different artifact. This applies whenever Bundle-Name is shorter or human-readable while the JAR filename carries the actual Maven artifactId.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 4f124c4. resolveManifestPackage now always uses filenameArtifact as the artifactId. Bundle-Name and Implementation-Title are OSGi/JAR display names — they're still passed to parseGroupID for groupId inference but are no longer candidates for the artifactId. The bcprov-jdk18on-1.78.1.jar fixture now produces org.bouncycastle:bcprov-jdk18on@1.78.1.

@anderruiz anderruiz force-pushed the ander/jar-manifest-fallback-parser branch from 75da718 to 4f124c4 Compare May 13, 2026 17:30
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4f124c4eb6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

// the split occurs at the LAST hyphen-digit boundary, not the first. This correctly
// handles artifactIds that themselves contain a hyphen-digit segment, e.g.
// log4j-1.2-api-2.17.1.jar → artifactId="log4j-1.2-api", version="2.17.1".
var jarFilenameRegex = cachedregexp.MustCompile(`^(.+)-(\d.*)\.jar$`)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Split snapshot versions before the timestamp suffix

For no-pom JARs whose filename version itself contains a hyphen followed by a digit, such as Maven's timestamped snapshots (my-lib-1.0-20260513.123456-1.jar) or other hyphenated numeric versions, this greedy match treats the last -\d as the artifact/version boundary. That yields filenameArtifact="my-lib-1.0-20260513.123456" and filenameVersion="1", so the manifest fallback can emit a non-existent coordinate even when the manifest has the correct package metadata. The split needs to account for numeric hyphens inside valid versions, ideally by using the manifest version when available or validating candidate splits from the right.

Useful? React with 👍 / 👎.

@anderruiz anderruiz force-pushed the ander/jar-manifest-fallback-parser branch from 4f124c4 to be88ee0 Compare May 13, 2026 20:05
Copy link
Copy Markdown
Contributor

@rjcoulter22 rjcoulter22 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mainly wondering about this https://github.com/DataDog/datadog-sbom-generator/pull/150/changes#r3237264109 - I might be missing something though so let me know 👍

Comment thread pkg/lockfile/java/jar-manifest.go Outdated
// Parsing stops at the first blank line, which ends the main section; per-entry
// sections that follow must not overwrite main-section values.
// Returns raw attribute values (not cleaned) so callers can apply appropriate cleaning.
func parseManifestAttributes(zipReader *zip.Reader) (bundleSymbolicName, bundleName, bundleVersion, implTitle, implVersion, automaticModuleName string, err error) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Could we return a struct here? It might make it a bit easier to unpack results/make any future modifcations

Comment thread pkg/lockfile/java/jar-manifest.go Outdated
return packages
}

bsn, bundleName, bundleVersion, _, implVersion, automaticModuleName, err := parseManifestAttributes(zipReader)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are not using that unpacked value can we just remove it from returning from this method? So something like

bsn, bundleName, bundleVersion, implVersion, automaticModuleName, err := parseManifestAttributes(zipReader)

filenameVersion: "1.78.1",
bsn: "bcprov",
bundleName: "bcprov",
bundleVersion: "1..78.1",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Version type? Should this be 1.78.1?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually not 😅 it replicates exactly the contents in MANIFEST.MF
Captura de pantalla 2026-05-13 a las 23 02 43


// artifactId comes from the filename: it is the most reliable source of the Maven
// artifact ID. Bundle-Name / Implementation-Title are display names only.
artifactID := filenameArtifact
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we unconditionally use the file name but the PR description says artifactId priority (Bundle-Name > Implementation-Title > filename), is the description just out of date?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was the original plan, but it was not resulting in the expected result for BC, so the file name took preference when the bundle name was not correct, let me update the description. I also sent the improvements to dd-trace-java so that they could implement them in the Java Tracer

Three improvements to the MANIFEST.MF fallback parser:

1. **Classifier stripping** (Codex P2): JAR filenames such as
   netty-tcnative-boringssl-static-2.0.61.Final-linux-x86_64.jar were
   capturing the entire suffix (2.0.61.Final-linux-x86_64) as the version.
   jarClassifierRegex now strips known OS/arch classifiers (linux, windows,
   osx, …) and common descriptor classifiers (sources, javadoc, native, …)
   before returning the version.

2. **Automatic-Module-Name for groupId** (BouncyCastle): when
   Bundle-SymbolicName has no dots (e.g. "bcprov"), the BSN dot-prefix
   heuristic cannot infer a groupId. parseGroupID now falls back to
   Automatic-Module-Name: first applies the same dot-prefix heuristic, then
   as a last resort strips the final segment of AMN when it has ≥ 2 dots
   (e.g. "org.bouncycastle.provider" → "org.bouncycastle").

3. **Filename artifact preference when Bundle-Name == BSN**: when Bundle-Name
   and BSN are identical after cleaning (both "bcprov"), Bundle-Name carries
   no additional information and is skipped so the more-specific
   filename-derived artifactId (e.g. "bcprov-jdk15to18") is used instead.

Together these produce org.bouncycastle:bcprov-jdk15to18@1.78.1 from the
real bcprov-jdk15to18-1.78.1.jar rather than bcprov:bcprov@1.78.1.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@anderruiz anderruiz force-pushed the ander/jar-manifest-fallback-parser branch from be88ee0 to 7ad4cc9 Compare May 13, 2026 21:10
@anderruiz anderruiz merged commit bfdde38 into main May 14, 2026
11 checks passed
@anderruiz anderruiz deleted the ander/jar-manifest-fallback-parser branch May 14, 2026 12:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants