Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Malformed HTML in multiple license files #1680

Open
ghost opened this issue Oct 15, 2022 · 20 comments
Open

Malformed HTML in multiple license files #1680

ghost opened this issue Oct 15, 2022 · 20 comments

Comments

@ghost
Copy link

ghost commented Oct 15, 2022

A similar problem raised in issue #1673 exists with CC-BY-2.5.html in that the nesting of <ul> within <ul> is invalid (the nested <ul> must be inside an <li> element):

source

I suspected that these types of errors are systemic, given that the same issue was encountered twice. Consider updating your processes to run the HTML through the W3C's validator service to check the HTML files so that all these types of errors can be corrected at one time, rather than waiting on individual issues being raised on a per-license basis.

From Linux, this can be accomplished as follows:

cd /tmp
git clone https://github.com/spdx/license-list-data/
cd license*/html
for i in *.html; do
  echo "--------------------------";
  echo "$i";
  echo "--------------------------";
  curl -H "Content-Type: text/html; charset=utf-8" --data-binary @$i "https://validator.w3.org/nu/?out=gnu" | grep -v "1.1-2.*";
  echo "";
done > results.txt

See attached:

results.txt.gz

Further filter the file using something like:

$ grep -v slash results.txt | grep -v 1.1-3 | grep -B2 "error:" | grep html$
AAL.html
Aladdin.html
APL-1.0.html
APSL-1.0.html
APSL-1.1.html
APSL-1.2.html
Artistic-2.0.html
Bitstream-Vera.html
BSD-3-Clause-Clear.html
BSD-Protection.html
Caldera.html
CC-BY-2.5-AU.html
CC-BY-2.5.html
CC-BY-NC-SA-1.0.html
CC-BY-SA-2.0.html
CDL-1.0.html
CECILL-1.1.html
LGPL-2.0.html
LGPL-2.0+.html
LGPL-2.0-only.html
LGPL-2.0-or-later.html
libpng-2.0.html
Motosoto.html
NBPL-1.0.html
OLDAP-1.1.html
OLDAP-1.2.html
OLDAP-1.3.html
OLDAP-1.4.html
OPUBL-1.0.html

That gives a good starting place for addressing most of the non-validating HTML files.

@goneall
Copy link
Member

goneall commented Oct 15, 2022

Thanks @DaveJarvis for the analysis.

In looking at the XML for the first few, the <list> elements contain other <list> elements which generates the erroneous HTML.

This is explicitly allowed in the schema file (see

<element name="list" type="tns:listType"/>
).

We can update the schema to disallow lists within lists which will prevent future submittals of license XML's with this error.

Unfortunately, it won't pass the CI until all the above licenses with this error are manually fixed - which is quite an effort.

I'll create a draft PR with the updated schema, but I won't have time to fix all the licenses. If there are any volunteers willing to help out, you can create a PR against the PR with the schema update. Once all the XML's are fixed up, we can merge the schema update and the fixed licenses.

@ghost
Copy link
Author

ghost commented Oct 15, 2022

Unfortunately, it won't pass the CI until all the above licenses with this error are manually fixed - which is quite an effort.

If it's all XML, there should be an XSL transformation that can be applied to fix them en masse. If you're not familiar with XSL, a StackOverflow guru might be willing to help.

Perhaps @michaelhkay (https://github.com/michaelhkay) may have some recommendations?

@goneall
Copy link
Member

goneall commented Oct 16, 2022

If it's all XML, there should be an XSL transformation that can be applied to fix them en masse. If you're not familiar with XSL, a StackOverflow guru might be willing to help.

That sounds like an excellent idea. I'm not much of an XML expert, so help with the transform would be great. I fact I could use some help getting the XSD to work - see the comment in PR #1681.

@goneall
Copy link
Member

goneall commented Oct 21, 2022

With the help of @zvr I was able to create a schema update that catches this error.

There are 45 licenses that will need to be fixed for them to pass the schema validation. This should also fix the malformed HTML next time the license list is published.

Anyone able to help could create a pull request against the branch listschema which can then be merged into PR #1681 before merging into main.

I'll leave the PR in draft mode until we have all the licenses fixed and ready to go.

@jlovejoy
Copy link
Member

so, will this mean that we can't have nested lists?

@goneall
Copy link
Member

goneall commented Oct 27, 2022

so, will this mean that we can't have nested lists?

@jlovejoy - no, you can still have nested lists. You just need to nest them as follows:

<list>
   <listItem>
      <list> ...

What we are disallowing is the following:

<list>
  <list>

which generates illegal HTML.

@ghost
Copy link
Author

ghost commented Oct 27, 2022

so, will this mean that we can't have nested lists?

Lists can be nested. The following fails W3C validation due to improper nesting (an ol or ul cannot be a direct child of an ol or ul element):

<ol>
  <li>1</li>
  <ol>
    <li>a</li>
    <ol>
      <li>i</li>
    </ol>
  </ol>
</ol>

Properly nested to pass W3C validation:

<ol>
  <li>1
  <ol>
    <li>a
    <ol>
      <li>i</li> <!-- close the 'i' list item -->
    </ol>
    </li> <!-- close the 'a' list item -->
  </ol>
  </li> <!-- close the '1' list item -->
</ol>

Most browsers will probably render them fine in practice, but strictly speaking the former does not conform to the standard.

@swinslow
Copy link
Member

Is this solved now with #1681? With the schema change there and the licenses where I added <item> tags, I think that fixes most of the licenses that @DaveJarvis listed, but maybe not all of them.

I didn't change the following ones from the list above:

  • APL-1.0
  • Bitstream-Vera
  • Caldera
  • CDL-1.0
  • CECILL-1.1
  • libpng-2.0
  • OPUBL-1.0

@goneall
Copy link
Member

goneall commented Dec 24, 2022

@DaveJarvis Would you mind running the script again with the latest license-list-data? I merged #1681 which should resolve most of the errors, but from the list above, it looks like there still may be other errors - perhaps from a different cause.

@ghost
Copy link
Author

ghost commented Dec 24, 2022

Full log: results.txt.gz

$ grep -v slash results.txt | grep -v 1.1-3 | grep -B2 "error:" | grep html$
APL-1.0.html
Bitstream-Vera.html
Caldera.html
CDL-1.0.html
CECILL-1.1.html
libpng-2.0.html
OPUBL-1.0.html

As well as:

$ grep -B2 "error:" results.txt 
:1015.52-1015.57: info: Trailing slash on void elements has no effect and interacts badly with unquoted attribute values.
:1017.59-1017.64: info: Trailing slash on void elements has no effect and interacts badly with unquoted attribute values.
:1142.16-1142.19: error: No “p” element in scope but a “p” end tag seen.
--
App-s2p.html
--------------------------
:1.1-3.9: error: Non-space characters found without seeing a doctype first. Expected “<!DOCTYPE html>”.
:1.1-3.9: error: Element “head” is missing a required instance of child element “title”.
--
Bitstream-Charter.html
--------------------------
:1.1-3.28: error: Start tag seen without seeing a doctype first. Expected “<!DOCTYPE html>”.
:1.1-3.28: error: Element “head” is missing a required instance of child element “title”.
--
Bitstream-Vera.html
--------------------------
:1.1-1.3: error: Start tag seen without seeing a doctype first. Expected “<!DOCTYPE html>”.
:1.1-1.3: error: Element “head” is missing a required instance of child element “title”.
--
Caldera.html
--------------------------
:41.7-41.10: error: No “p” element in scope but a “p” end tag seen.
--
CDL-1.0.html
--------------------------
:39.6-39.9: error: No “p” element in scope but a “p” end tag seen.
:65.6-65.9: error: No “p” element in scope but a “p” end tag seen.
--
--------------------------
:8.9-8.14: info: Trailing slash on void elements has no effect and interacts badly with unquoted attribute values.
:29.10-29.13: error: No “p” element in scope but a “p” end tag seen.
:57.9-57.12: error: No “p” element in scope but a “p” end tag seen.
--
checkmk.html
--------------------------
:1.1-3.44: error: Start tag seen without seeing a doctype first. Expected “<!DOCTYPE html>”.
:1.1-3.44: error: Element “head” is missing a required instance of child element “title”.
--
CNRI-Jython.html
--------------------------
:1.1-3.28: error: Start tag seen without seeing a doctype first. Expected “<!DOCTYPE html>”.
:1.1-3.28: error: Element “head” is missing a required instance of child element “title”.
--
diffmark.html
--------------------------
:1.1-3.28: error: Start tag seen without seeing a doctype first. Expected “<!DOCTYPE html>”.
:1.1-3.28: error: Element “head” is missing a required instance of child element “title”.
--
etalab-2.0.html
--------------------------
:1.1-3.41: error: Start tag seen without seeing a doctype first. Expected “<!DOCTYPE html>”.
:1.1-3.41: error: Element “head” is missing a required instance of child element “title”.
--
libpng-2.0.html
--------------------------
:56.7-56.10: error: No “p” element in scope but a “p” end tag seen.
--
OPUBL-1.0.html
--------------------------
:63.6-63.9: error: No “p” element in scope but a “p” end tag seen.
:81.6-81.9: error: No “p” element in scope but a “p” end tag seen.
--
Ruby.html
--------------------------
:1.1-3.28: error: Start tag seen without seeing a doctype first. Expected “<!DOCTYPE html>”.
:1.1-3.28: error: Element “head” is missing a required instance of child element “title”.
--
SchemeReport.html
--------------------------
:1.1-3.9: error: Start tag seen without seeing a doctype first. Expected “<!DOCTYPE html>”.
:1.1-3.9: error: Element “head” is missing a required instance of child element “title”.

@swinslow
Copy link
Member

Thanks @DaveJarvis! Looking briefly through the errors:

  • Several of these appear to be relating to the files not having <!DOCTYPE html> and <head> elements at the start. @goneall I see from the main license-list-data README that the html versions are just meant to be simple HTML snippets, not fully valid HTML files, so should these be disregarded?
  • The others (e.g. Caldera and APL-1.0) appear to be caused by a <list> element sitting in the middle of <p></p> tags. Would this be valid, or should this be changed in these XML files to close the starting <p> tag before the <list> starts?

@ghost
Copy link
Author

ghost commented Dec 24, 2022

should this be changed in these XML files

Making them valid XML allows for XSLT preprocessing.

@goneall
Copy link
Member

goneall commented Dec 25, 2022

Thanks @DaveJarvis and @swinslow for the additional analysis.

I see from the main license-list-data README that the html versions are just meant to be simple HTML snippets, not fully valid HTML files, so should these be disregarded?

Yes - these are snippets to be included inside another "well formed" HTML page, so I think we can ignore these.

Would this be valid, or should this be changed in these XML files to close the starting

tag before the starts?

I found this Stack Overflow article which suggests the paragraph tags should be close before the list elements.

We should probably change the XML and update the XML schema to make future occurrences an error. I added issue #1758 to update the schema. Since it is a bit involved, I won't be tackling this issue within the next week or so, but I'll try to get back to it after the holidays.

@swinslow
Copy link
Member

Thank you both.

Even though #1758 is now added to track the specific "lists in paragraphs" issue, for the time being I'll leave this one open as well. After #1758 / #1762 are resolved, @DaveJarvis perhaps you can then re-run your checks to see if any issues are still showing up. Thanks!

@ghost
Copy link
Author

ghost commented Dec 27, 2022

Ping me when the main branch is updated.

@jlovejoy
Copy link
Member

where are we at with this?

@swinslow
Copy link
Member

This is related to #1758 / #1762. I need to refresh my recollection on where these are at following the PR that @goneall and I worked on in Oct. / Dec. in #1681. That said, the related issues are moved to 3.21 so I'm moving this as well.

@swinslow swinslow modified the milestones: 3.20, 3.21 Feb 15, 2023
@swinslow swinslow modified the milestones: 3.21, 3.22 Jun 18, 2023
@jlovejoy
Copy link
Member

jlovejoy commented Oct 1, 2023

@goneall @swinslow - is this still an open issue?

@ghost
Copy link
Author

ghost commented Oct 5, 2023

FYI, in two days I will be locked out of my account once GitHub begins enforcing MFA. Sorry I won't be able to help any further.

@swinslow swinslow removed this from the 3.22 milestone Oct 5, 2023
@jlovejoy
Copy link
Member

jlovejoy commented May 9, 2024

@swinslow @goneall - I'm moving this to "later release" but wuold be good to get to closure!

@jlovejoy jlovejoy modified the milestones: 3.24, Later Release May 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants