-
Notifications
You must be signed in to change notification settings - Fork 290
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Malformed HTML in multiple license files #1680
Comments
Thanks @DaveJarvis for the analysis. In looking at the XML for the first few, the This is explicitly allowed in the schema file (see license-list-XML/schema/ListedLicense.xsd Line 319 in a8f83ee
We can update the schema to disallow lists within lists which will prevent future submittals of license XML's with this error. Unfortunately, it won't pass the CI until all the above licenses with this error are manually fixed - which is quite an effort. I'll create a draft PR with the updated schema, but I won't have time to fix all the licenses. If there are any volunteers willing to help out, you can create a PR against the PR with the schema update. Once all the XML's are fixed up, we can merge the schema update and the fixed licenses. |
Related to issue #1680 Signed-off-by: Gary O'Neall <[email protected]>
If it's all XML, there should be an XSL transformation that can be applied to fix them en masse. If you're not familiar with XSL, a StackOverflow guru might be willing to help. Perhaps @michaelhkay (https://github.com/michaelhkay) may have some recommendations? |
That sounds like an excellent idea. I'm not much of an XML expert, so help with the transform would be great. I fact I could use some help getting the XSD to work - see the comment in PR #1681. |
With the help of @zvr I was able to create a schema update that catches this error. There are 45 licenses that will need to be fixed for them to pass the schema validation. This should also fix the malformed HTML next time the license list is published. Anyone able to help could create a pull request against the branch I'll leave the PR in draft mode until we have all the licenses fixed and ready to go. |
so, will this mean that we can't have nested lists? |
@jlovejoy - no, you can still have nested lists. You just need to nest them as follows:
What we are disallowing is the following:
which generates illegal HTML. |
Lists can be nested. The following fails W3C validation due to improper nesting (an <ol>
<li>1</li>
<ol>
<li>a</li>
<ol>
<li>i</li>
</ol>
</ol>
</ol> Properly nested to pass W3C validation: <ol>
<li>1
<ol>
<li>a
<ol>
<li>i</li> <!-- close the 'i' list item -->
</ol>
</li> <!-- close the 'a' list item -->
</ol>
</li> <!-- close the '1' list item -->
</ol> Most browsers will probably render them fine in practice, but strictly speaking the former does not conform to the standard. |
Is this solved now with #1681? With the schema change there and the licenses where I added I didn't change the following ones from the list above:
|
@DaveJarvis Would you mind running the script again with the latest license-list-data? I merged #1681 which should resolve most of the errors, but from the list above, it looks like there still may be other errors - perhaps from a different cause. |
Full log: results.txt.gz
As well as:
|
Thanks @DaveJarvis! Looking briefly through the errors:
|
Making them valid XML allows for XSLT preprocessing. |
Thanks @DaveJarvis and @swinslow for the additional analysis.
Yes - these are snippets to be included inside another "well formed" HTML page, so I think we can ignore these.
I found this Stack Overflow article which suggests the paragraph tags should be close before the list elements. We should probably change the XML and update the XML schema to make future occurrences an error. I added issue #1758 to update the schema. Since it is a bit involved, I won't be tackling this issue within the next week or so, but I'll try to get back to it after the holidays. |
Ping me when the main branch is updated. |
where are we at with this? |
FYI, in two days I will be locked out of my account once GitHub begins enforcing MFA. Sorry I won't be able to help any further. |
A similar problem raised in issue #1673 exists with CC-BY-2.5.html in that the nesting of
<ul>
within<ul>
is invalid (the nested<ul>
must be inside an<li>
element):source
I suspected that these types of errors are systemic, given that the same issue was encountered twice. Consider updating your processes to run the HTML through the W3C's validator service to check the HTML files so that all these types of errors can be corrected at one time, rather than waiting on individual issues being raised on a per-license basis.
From Linux, this can be accomplished as follows:
See attached:
results.txt.gz
Further filter the file using something like:
That gives a good starting place for addressing most of the non-validating HTML files.
The text was updated successfully, but these errors were encountered: