Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write higher level acceptance tests #734

Closed
lionel-nj opened this issue Feb 9, 2021 · 12 comments · Fixed by #848, #926, #1037 or #1038
Closed

Write higher level acceptance tests #734

lionel-nj opened this issue Feb 9, 2021 · 12 comments · Fixed by #848, #926, #1037 or #1038
Assignees

Comments

@lionel-nj
Copy link
Contributor

lionel-nj commented Feb 9, 2021

At present, most implemented tests are unit test. These make sure that the individual units / components of are tested.

Higher level test should be written to verify the execution of multiple modules.

@lionel-nj
Copy link
Contributor Author

lionel-nj commented Feb 9, 2021

After discussion with @barbeau we think that executing the validator on sample GTFS datasets, and analyzing the output of the generated report (via parsing) could be a way to write such tests.
We could start by elaborating GTFS datasets that will lead to the generation of different system errors, and extend the same process to validation errors.

Any thoughts/advise? @aababilov?

@lionel-nj lionel-nj added this to the v2.0 milestone Feb 9, 2021
@derhuerst
Copy link

From the related Issue public-transport/ideas#17, which in turn quotes georust/transitfeed#5 (comment):

reusability

Answering quite basic (but very relevant in practical usage of GTFS) questions like When does any vehicle depart at a bus stop? is surprisingly much work: GTFS Time values are inherently timezone-dependent, frequencies.txt with exact_times=1 defines "stop times" as well, etc. With the ever-growing number of optional parts and extensions, doing GTFS processing right is a lot of work, so we should make the implementation in this project as reusable/flexible as possible.
Also, a project- and language-independent test suite, i.e. a set of fixtures per "question"/operation, would be very helpful for this. Those have been very successful in other areas, e.g. for WebSocket implementations or for HTML5 parsers.

validation

There are a bazillion validation (i.e. "semantic checks on the actual data") cases; The best practices page is long, and the GTFS issue tracker and mailing lists are full of edge cases. There are at east 20 libs across languages doing some form of validation, but none of them cover all the issues that we see with GTFS feeds in the wild.
I'd dare to say that people don't care which language a GTFS validator is written in, but they strongly prefer a certain language for "questions"/analysis. Like the "questions"/analysis mentioned above, validation lends itself to a project- and language-independent set of fixtures, maintained by the wider GTFS community. I hope this will push the overall quality of GTFS feeds, and reduce the amount of duplicated work poured into all those GTFS validation libs. I therefore propose not to put too much effort into validation in this project (I'm obviously just a random stranger telling what to do 😬).

@barbeau
Copy link
Member

barbeau commented Feb 9, 2021

I see this being useful in the context of issues like #712 (comment) as well, where we want to ensure that we aren't generating errors on valid datasets (or at least know when we are, if the spec needs to be updated to reflect practice).

@nackko
Copy link
Contributor

nackko commented Feb 11, 2021

GitHub workflows could be leveraged for those integration tests. Maybe a shortcut could consist in a binary diff? That could work if the validator is deterministic.

@barbeau
Copy link
Member

barbeau commented Feb 11, 2021

GitHub workflows could be leveraged for those integration tests.

I agree - although it's not clear to me the mechanism by which the results should be evaluated yet. Is it a separate module that executes and parses and checks the JSON output? Or does it directly instantiate the validator and then check the data structure of notices without outputting JSON?

Parsing the JSON output would certainly be the most "end-to-end" as it would confirm JSON output also isn't broken.

Maybe a shortcut could consist in a binary diff?

If you mean binary check the JSON output (or data structure, if it's somehow done internally), yes, I think that could work as a preliminary shortcut after initial check is run. Basically if we already checked the JSON output previously and confirmed it was correct, and the new JSON output hash matches the old, then nothing changed and it's fine. Although it's likely that we could add INFO notices at some point that could change the file hash, which would then require parsing the contents and checking number of errors, warnings, etc. And we'd need to figure out where to store the "last valid hash" for each feed output.

@nackko
Copy link
Contributor

nackko commented Feb 11, 2021

I agree - although it's not clear to me the mechanism by which the results should be evaluated yet. Is it a separate module that executes and parses and checks the JSON output? Or does it directly instantiate the validator and then check the data structure of notices without outputting JSON?

IMHO those being integration tests, it would make sense they would go all the way and check the JSON output itself. In terms of GH workflows, it could either be a new step in end to end or another workflow that grabs the artifacts previously outputted.

If you mean binary check the JSON output (or data structure, if it's somehow done internally)

In my mind that was the JSON file itself, diffed with a previous run output that is deemed valid for the input data.
I didn't think of hashing but it's an obvious thing to do. Being multithreaded, I'm not sure the code is deterministic though (@aababilov ?). You are right regarding the info notices, though would that happen frequently?
Thinking about it, maybe they should be trimmed as their content can be dataset independent (like a timestamp or execution time output).

The reference reports will have to be somewhere, or their hash, but that could also potentially be handled as workflow artifacts, I think?

@barbeau
Copy link
Member

barbeau commented Feb 11, 2021

The reference reports will have to be somewhere, or their hash, but that could also potentially be handled as workflow artifacts, I think?

I think so - although thinking more these JSON files should be pretty small and should be fast to parse, so I'm not sure if the complexity of the hash solution would be worth the performance gain. But we can see.

@barbeau
Copy link
Member

barbeau commented Feb 11, 2021

As discussed in #742, currently when feeds fail to parse the validator still finishes with Java exit code 0 and very little output. #742 added text to the system output to flag this condition for humans observing output.

We should also handle this case for these types of tests - we can parse the JSON output and then intentionally fail the CI build to flag failures to parse so we're automatically notified.

@barbeau barbeau modified the milestones: v2.0, v2.1 Feb 15, 2021
@lionel-nj lionel-nj self-assigned this Feb 18, 2021
@barbeau barbeau changed the title Write higher level tests Write higher level integration tests Mar 1, 2021
@nackko
Copy link
Contributor

nackko commented Mar 3, 2021

Being multithreaded, I'm not sure the code is deterministic though (@aababilov ?).

Got my answer. It is NOT deterministic so I guess an automated process could reason on the number of each type of Notices. Is there any specially crafted verybaddataset/verygooddataset out there that would be a good first case for such CI/CD process?

@barbeau
Copy link
Member

barbeau commented Mar 3, 2021

It is NOT deterministic so I guess an automated process could reason on the number of each type of Notices.

Correct, order of notices isn't deterministic, but number of each type of notice is.

@lionel-nj
Copy link
Contributor Author

Per discussion with @barbeau and @carlfredl this process will be automated. We consider implementing a new module for testing the validator against various datasets (ultimately all the datasets available via Mobility Archives) to make sure that the implementation of a new rule does not make a large portion of existing datasets wrongfully invalid.

Before merging to the master branch, each snapshot(intermediate) version of the validator that includes modifications to existing rules and/or new rule implementation should be executed on the variety of selected GTFS schedule datasets.

We consider using fix versions of datasets in this process in order to obtain consistent results. After parsing the validation report, a dataset will be considered invalid if said report contains more than a certain number of errors, and the percentage of new invalid datasets will be used to determine whether a new implementation is acceptable or not (numbers yet to be determined: that is subject where we would be glad to have community feedback @aababilov 😉).

Our vision here is coming from what we've heard from consumers: they need to trust that changes to the open source project will be incremental. e.g. a change which makes half of the GTFS datasets already in use around the world invalid overnight is too extreme to be useable.

Next steps
Begin implementation using a small number of datasets, get community feedback, and then extend the process to test the validator on a broader variety of GTFS schedule datasets leveraging the Mobility Archives project (@maximearmstrong ).

@carlfredl carlfredl changed the title Write higher level integration tests Write higher level integration / acceptance tests Apr 7, 2021
@lionel-nj lionel-nj mentioned this issue Apr 9, 2021
4 tasks
@lionel-nj lionel-nj changed the title Write higher level integration / acceptance tests Write higher level acceptance tests Jun 11, 2021
@isabelle-dr
Copy link
Contributor

An update on this issue:
MobilityData is working on it in PR #933 and PR #848

The principle is the following: run two versions of the validator (the latest release and the snapshot version) on each GTFS datasets that are in the MobilityArchives and compare the number of errors per validation report (for each dataset). Finally, an addition or a change of code will be flagged if it makes more than 1% of datasets invalid.
Such tests rely on Kubernetes and Github Actions and are triggered on each push in a given pull request.
See the following diagram:
https://user-images.githubusercontent.com/35747326/130859517-56490d7a-de7d-4d91-b73a-f7bf209337aa.png

Needed:

  • Validator linked to the MobiltyDatabase (running on all datasets present on the MobilityDatabase)
  • Add a step before committing/merging a PR = ensure no disruption in the datasets when a new rule is added.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment