-
Notifications
You must be signed in to change notification settings - Fork 101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Write higher level acceptance tests #734
Comments
After discussion with @barbeau we think that executing the validator on sample GTFS datasets, and analyzing the output of the generated report (via parsing) could be a way to write such tests. Any thoughts/advise? @aababilov? |
From the related Issue public-transport/ideas#17, which in turn quotes georust/transitfeed#5 (comment):
|
I see this being useful in the context of issues like #712 (comment) as well, where we want to ensure that we aren't generating errors on valid datasets (or at least know when we are, if the spec needs to be updated to reflect practice). |
GitHub workflows could be leveraged for those integration tests. Maybe a shortcut could consist in a binary diff? That could work if the validator is deterministic. |
I agree - although it's not clear to me the mechanism by which the results should be evaluated yet. Is it a separate module that executes and parses and checks the JSON output? Or does it directly instantiate the validator and then check the data structure of notices without outputting JSON? Parsing the JSON output would certainly be the most "end-to-end" as it would confirm JSON output also isn't broken.
If you mean binary check the JSON output (or data structure, if it's somehow done internally), yes, I think that could work as a preliminary shortcut after initial check is run. Basically if we already checked the JSON output previously and confirmed it was correct, and the new JSON output hash matches the old, then nothing changed and it's fine. Although it's likely that we could add INFO notices at some point that could change the file hash, which would then require parsing the contents and checking number of errors, warnings, etc. And we'd need to figure out where to store the "last valid hash" for each feed output. |
IMHO those being integration tests, it would make sense they would go all the way and check the JSON output itself. In terms of GH workflows, it could either be a new step in end to end or another workflow that grabs the artifacts previously outputted.
In my mind that was the JSON file itself, diffed with a previous run output that is deemed valid for the input data. The reference reports will have to be somewhere, or their hash, but that could also potentially be handled as workflow artifacts, I think? |
I think so - although thinking more these JSON files should be pretty small and should be fast to parse, so I'm not sure if the complexity of the hash solution would be worth the performance gain. But we can see. |
As discussed in #742, currently when feeds fail to parse the validator still finishes with Java exit code 0 and very little output. #742 added text to the system output to flag this condition for humans observing output. We should also handle this case for these types of tests - we can parse the JSON output and then intentionally fail the CI build to flag failures to parse so we're automatically notified. |
Got my answer. It is NOT deterministic so I guess an automated process could reason on the number of each type of Notices. Is there any specially crafted verybaddataset/verygooddataset out there that would be a good first case for such CI/CD process? |
Correct, order of notices isn't deterministic, but number of each type of notice is. |
Per discussion with @barbeau and @carlfredl this process will be automated. We consider implementing a new module for testing the validator against various datasets (ultimately all the datasets available via Mobility Archives) to make sure that the implementation of a new rule does not make a large portion of existing datasets wrongfully invalid. Before merging to the master branch, each snapshot(intermediate) version of the validator that includes modifications to existing rules and/or new rule implementation should be executed on the variety of selected GTFS schedule datasets. We consider using fix versions of datasets in this process in order to obtain consistent results. After parsing the validation report, a dataset will be considered invalid if said report contains more than a certain number of errors, and the percentage of new invalid datasets will be used to determine whether a new implementation is acceptable or not (numbers yet to be determined: that is subject where we would be glad to have community feedback @aababilov 😉). Our vision here is coming from what we've heard from consumers: they need to trust that changes to the open source project will be incremental. e.g. a change which makes half of the GTFS datasets already in use around the world invalid overnight is too extreme to be useable. Next steps |
An update on this issue: The principle is the following: run two versions of the validator (the latest release and the snapshot version) on each GTFS datasets that are in the MobilityArchives and compare the number of errors per validation report (for each dataset). Finally, an addition or a change of code will be flagged if it makes more than 1% of datasets invalid. Needed:
|
At present, most implemented tests are unit test. These make sure that the individual units / components of are tested.
Higher level test should be written to verify the execution of multiple modules.
The text was updated successfully, but these errors were encountered: