Add MANIFEST.MF fallback parser for JARs without pom.properties#150
Conversation
When a JAR file has no embedded pom.properties, fall back to inferring Maven coordinates from MANIFEST.MF attributes and the JAR filename. This replicates the Java Tracer's Dependency.java guessFallbackNoPom algorithm for groupId inference using the dot-prefix heuristic on Bundle-SymbolicName, with artifactId and version priority chains. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
🎯 Code Coverage (details) 🔗 Commit SHA: 7ad4cc9 | Docs | Datadog PR Page | Give us feedback! |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 66e2afd451
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| key, value, found := strings.Cut(line, ": ") | ||
| if !found { | ||
| currentKey = "" | ||
|
|
||
| continue |
There was a problem hiding this comment.
Stop parsing manifest attributes at the main section
For JARs whose MANIFEST.MF contains per-entry sections after the blank line that ends the main section, this continues parsing those later headers and can overwrite the package-level values collected above. Because the fallback uses these attributes to infer the Maven name/version, a no-pom JAR with class/file-specific Implementation-Title, Implementation-Version, or OSGi headers can be reported as the wrong package; the parser should stop at the first blank line in the manifest main section.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in 4e7e90d. The scanner now breaks on the first blank line, which is the JAR spec's main-section terminator. Added a regression test: a manifest with per-entry sections containing a different Bundle-SymbolicName and Implementation-Version — the parser correctly returns only the main-section values.
There was a problem hiding this comment.
Fixed in 4e7e90d. The scanner now breaks on the first blank line, which is the JAR spec's main-section terminator. Added a regression test: a manifest with per-entry sections containing a different Bundle-SymbolicName and Implementation-Version — the parser correctly returns only the main-section values.
Per the JAR spec, a blank line ends the main section of MANIFEST.MF. Per-entry sections follow and can contain class/file-specific headers that share attribute names with the main section (e.g. Implementation-Title, Bundle-SymbolicName). The previous scanner consumed all lines without stopping, so per-entry headers could overwrite main-section values and cause a JAR to be reported under the wrong package identity. Add a break on the first blank line so only main-section attributes are used for package inference. Add a regression test covering this case. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 4e7e90d327
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| // jarFilenameRegex matches Maven-convention JAR filenames: artifactId-version.jar | ||
| // The version must start with a digit. The artifactId is the minimal non-greedy match, | ||
| // and the version captures everything after the first hyphen-digit boundary through .jar. | ||
| var jarFilenameRegex = cachedregexp.MustCompile(`^(.+?)-(\d.*)\.jar$`) |
There was a problem hiding this comment.
Strip Maven classifiers from fallback versions
When scanning a no-pom JAR whose filename includes a Maven classifier, such as netty-tcnative-boringssl-static-2.0.61.Final-linux-x86_64.jar, this regex captures 2.0.61.Final-linux-x86_64 as the package version. Because resolveManifestPackage prefers the filename-derived version whenever the manifest versions do not both agree, the fallback can emit an SBOM component with a non-existent Maven version even when Bundle-Version contains the real version; classified artifacts are already present in this repo's Maven/Gradle fixtures, so the filename parser should recognize and drop the classifier portion rather than treating it as part of the version.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: db01517de6
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| // jarFilenameRegex matches Maven-convention JAR filenames: artifactId-version.jar | ||
| // The version must start with a digit. The artifactId is the minimal non-greedy match, | ||
| // and the version captures everything after the first hyphen-digit boundary through .jar. | ||
| var jarFilenameRegex = cachedregexp.MustCompile(`^(.+?)-(\d.*)\.jar$`) |
There was a problem hiding this comment.
Avoid splitting artifact IDs at the first digit segment
For no-pom JARs whose artifactId itself contains a hyphen followed by a digit, such as log4j-1.2-api-2.17.1.jar, this non-greedy regex parses log4j as the artifact and 1.2-api-2.17.1 as the version. That filename-derived version is later preferred whenever manifest versions are missing or disagree, and the artifact value is also used for group/name inference, so the fallback can emit a non-existent Maven coordinate for a valid Maven filename. Consider deriving the filename split from the manifest version when available or otherwise testing version boundaries from the right instead of the first -\d boundary.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in 75da718 by changing the regex from non-greedy (.+?) to greedy (.+). The greedy match finds the rightmost -\d boundary rather than the leftmost, so log4j-1.2-api-2.17.1.jar now splits as log4j-1.2-api + 2.17.1. Added a test case to cover this pattern.
db01517 to
75da718
Compare
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 75da7188fc
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| case cleanedBundleName != "" && cleanedBundleName != bsn: | ||
| artifactID = cleanedBundleName |
There was a problem hiding this comment.
Preserve the filename artifactId for manifest fallback
For JARs whose manifest Bundle-Name is only a display/generic name, this branch emits the wrong Maven coordinate by replacing the filename artifactId. The new bcprov-jdk18on-1.78.1.jar fallback fixture itself demonstrates the problem: Maven Central identifies it as org.bouncycastle:bcprov-jdk18on:1.78.1, but this code returns org.bouncycastle:bcprov:1.78.1, so SBOM/PURL vulnerability matching will look up a different artifact. This applies whenever Bundle-Name is shorter or human-readable while the JAR filename carries the actual Maven artifactId.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in 4f124c4. resolveManifestPackage now always uses filenameArtifact as the artifactId. Bundle-Name and Implementation-Title are OSGi/JAR display names — they're still passed to parseGroupID for groupId inference but are no longer candidates for the artifactId. The bcprov-jdk18on-1.78.1.jar fixture now produces org.bouncycastle:bcprov-jdk18on@1.78.1.
75da718 to
4f124c4
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 4f124c4eb6
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| // the split occurs at the LAST hyphen-digit boundary, not the first. This correctly | ||
| // handles artifactIds that themselves contain a hyphen-digit segment, e.g. | ||
| // log4j-1.2-api-2.17.1.jar → artifactId="log4j-1.2-api", version="2.17.1". | ||
| var jarFilenameRegex = cachedregexp.MustCompile(`^(.+)-(\d.*)\.jar$`) |
There was a problem hiding this comment.
Split snapshot versions before the timestamp suffix
For no-pom JARs whose filename version itself contains a hyphen followed by a digit, such as Maven's timestamped snapshots (my-lib-1.0-20260513.123456-1.jar) or other hyphenated numeric versions, this greedy match treats the last -\d as the artifact/version boundary. That yields filenameArtifact="my-lib-1.0-20260513.123456" and filenameVersion="1", so the manifest fallback can emit a non-existent coordinate even when the manifest has the correct package metadata. The split needs to account for numeric hyphens inside valid versions, ideally by using the manifest version when available or validating candidate splits from the right.
Useful? React with 👍 / 👎.
4f124c4 to
be88ee0
Compare
rjcoulter22
left a comment
There was a problem hiding this comment.
Mainly wondering about this https://github.com/DataDog/datadog-sbom-generator/pull/150/changes#r3237264109 - I might be missing something though so let me know 👍
| // Parsing stops at the first blank line, which ends the main section; per-entry | ||
| // sections that follow must not overwrite main-section values. | ||
| // Returns raw attribute values (not cleaned) so callers can apply appropriate cleaning. | ||
| func parseManifestAttributes(zipReader *zip.Reader) (bundleSymbolicName, bundleName, bundleVersion, implTitle, implVersion, automaticModuleName string, err error) { |
There was a problem hiding this comment.
nit: Could we return a struct here? It might make it a bit easier to unpack results/make any future modifcations
| return packages | ||
| } | ||
|
|
||
| bsn, bundleName, bundleVersion, _, implVersion, automaticModuleName, err := parseManifestAttributes(zipReader) |
There was a problem hiding this comment.
If we are not using that unpacked value can we just remove it from returning from this method? So something like
bsn, bundleName, bundleVersion, implVersion, automaticModuleName, err := parseManifestAttributes(zipReader)
| filenameVersion: "1.78.1", | ||
| bsn: "bcprov", | ||
| bundleName: "bcprov", | ||
| bundleVersion: "1..78.1", |
There was a problem hiding this comment.
Question: Version type? Should this be 1.78.1?
|
|
||
| // artifactId comes from the filename: it is the most reliable source of the Maven | ||
| // artifact ID. Bundle-Name / Implementation-Title are display names only. | ||
| artifactID := filenameArtifact |
There was a problem hiding this comment.
Here we unconditionally use the file name but the PR description says artifactId priority (Bundle-Name > Implementation-Title > filename), is the description just out of date?
There was a problem hiding this comment.
That was the original plan, but it was not resulting in the expected result for BC, so the file name took preference when the bundle name was not correct, let me update the description. I also sent the improvements to dd-trace-java so that they could implement them in the Java Tracer
Three improvements to the MANIFEST.MF fallback parser: 1. **Classifier stripping** (Codex P2): JAR filenames such as netty-tcnative-boringssl-static-2.0.61.Final-linux-x86_64.jar were capturing the entire suffix (2.0.61.Final-linux-x86_64) as the version. jarClassifierRegex now strips known OS/arch classifiers (linux, windows, osx, …) and common descriptor classifiers (sources, javadoc, native, …) before returning the version. 2. **Automatic-Module-Name for groupId** (BouncyCastle): when Bundle-SymbolicName has no dots (e.g. "bcprov"), the BSN dot-prefix heuristic cannot infer a groupId. parseGroupID now falls back to Automatic-Module-Name: first applies the same dot-prefix heuristic, then as a last resort strips the final segment of AMN when it has ≥ 2 dots (e.g. "org.bouncycastle.provider" → "org.bouncycastle"). 3. **Filename artifact preference when Bundle-Name == BSN**: when Bundle-Name and BSN are identical after cleaning (both "bcprov"), Bundle-Name carries no additional information and is skipped so the more-specific filename-derived artifactId (e.g. "bcprov-jdk15to18") is used instead. Together these produce org.bouncycastle:bcprov-jdk15to18@1.78.1 from the real bcprov-jdk15to18-1.78.1.jar rather than bcprov:bcprov@1.78.1. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
be88ee0 to
7ad4cc9
Compare

Summary
JarPomPropertiesExtractorfor JAR files that lackMETA-INF/maven/*/pom.properties(e.g. BouncyCastlebcprov-jdk15to18-1.78.1.jar)Dependency.javaguessFallbackNoPomalgorithm: groupId inference via dot-prefix heuristic onBundle-SymbolicName, version priority (agreement > filename >Bundle-Version>Implementation-Version)Bundle-NameandImplementation-Titleare OSGi/JAR display names and are not reliable Maven artifact IDs (e.g.Bundle-Name: bcprovvs filenamebcprov-jdk18on); the filename-derived artifact ID is always usednetty-tcnative-boringssl-static-2.0.61.Final-linux-x86_64.jarnow yield version2.0.61.Finalinstead of including the classifier; known OS/arch (linux,windows,osx, …) and descriptor (sources,javadoc,native, …) classifiers are stripped; Maven qualifiers (-SNAPSHOT,-Final,-RC*) are left untouched-\dboundary solog4j-1.2-api-2.17.1.jarsplits asartifactId=log4j-1.2-api,version=2.17.1Automatic-Module-Namefallback for groupId: whenBundle-SymbolicNamehas no dots (e.g.bcprov), the BSN dot-prefix heuristic cannot infer a groupId;parseGroupIDfalls back toAutomatic-Module-Name, first applying the same dot-prefix heuristic, then stripping the last dot-segment when AMN has >= 2 dots (e.g.org.bouncycastle.provider->org.bouncycastle)parseManifestAttributesreturns amanifestAttrsstruct instead of multiple return valuesAlgorithm Details
groupId inference (
parseGroupID):Bundle-SymbolicName(everything after;)[filenameArtifact, cleanName(bundleName)]against BSN dot-prefix heuristicAutomatic-Module-Name, then strips AMN last segment (requires >= 2 dots)artifactId: always the filename-derived artifact ID;
Bundle-NameandImplementation-Titleare only used as secondary candidates for groupId inference, never as the artifact IDVersion priority:
Bundle-Version == Implementation-Version(high confidence) -> filename version ->Bundle-Version->Implementation-VersionFilename regex:
^(.+)-(\d.*)\.jar$(greedy, rightmost split) + classifier stripping viajarClassifierRegexReal-world result for
bcprov-jdk15to18-1.78.1.jar:bcprov:bcprov@1.78.1org.bouncycastle:bcprov-jdk15to18@1.78.1Test plan
parseJarFilename(incl. classifier cases),parseManifestAttributes(incl. AMN + per-entry isolation),parseGroupID(incl. AMN fallback),resolveManifestPackage(incl. artifactId always from filename)./scripts/run_tests.shpasses./scripts/run_lints.shpassesGenerated with Claude Code