Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1.0 API Design #5

Open
teburd opened this issue Oct 12, 2017 · 9 comments
Open

1.0 API Design #5

teburd opened this issue Oct 12, 2017 · 9 comments
Assignees
Milestone

Comments

@teburd
Copy link
Member

teburd commented Oct 12, 2017

It would be much nicer to simply work with the .zip file

Ex:

let gtfs = GTFS::from_zip("gtfs.zip").unwrap();
for agency in gtfs.agencies() {
   println!("{:?}", agency);
}
for stop in gtfs.stops() {
   println!("{:?}", agency);
}

Solved with #14

@teburd teburd added this to the 1.0 milestone Nov 9, 2017
@teburd teburd changed the title Simplified API 1.0 API Design Nov 9, 2017
@teburd
Copy link
Member Author

teburd commented Nov 9, 2017

For a 1.0 release lets talk about what we want this crate to look like and how we want it to work.

I think we're on the right track so far. Generally speaking I'd like to differentiate between GTFS the format, and a TransitNetwork the API.

Things I'd want to be able to do with a GTFS type

  • Read GTFS and decode into useful structures
  • Write GTFS and encode useful structures
  • Validate based on the rules given by the spec

Versus things I'd want to be able to do with a TransitNetwork type

  • Get general information about the network (agency, shape, etc)
  • Get a list of routes and their types
  • Get a list of stops for a route
  • Get a list of stops near a location
  • Get related information for a stop (routes, trips, stop times, geolocation/shape)

I'd like to think that TransitNetwork is a trait, not an implementation, and that there would be an implementation for GTFS but that their might also be an implementations for perhaps the wide variety of live transit API's out there.

I think a lot of what your thinking is similiar to what I was thinking for what I'm tentatively calling the TransitNetwork trait. It can aggregate shapes/stop times, have various indices and such to help provide the convenience I think most people (myself included) really want to get out of the data that is stored in GTFS and provided by various transit APIs

I look forward to your thoughts. If we get a nice plan together we can break things down and work on various parts individually and together as we need to.

@medwards
Copy link
Collaborator

I don't need writing or validation atm but FWIW I think you're right that it should be included in a 1.0 release. GTFS reading is also still unfinished: we're missing Extended GTFS support. Whether that needs to be in place for a 1.0 release or not is a separate discussion, for now I'm just reminding you.

As far as TransitNetwork stuff: I think you've defined a good set of high level features that are of interest to feed consumers. I'd be careful about trying to get abstract and making it a trait at this juncture, in particular GTFS makes a lot of decisions about its domain model that aren't reflected in other specs. That's a decision we can always revisit anyways.

I have some ideas for an extended feature set for TransitNetwork too but the scope is already pretty big so imma hold back.

medwards added a commit to medwards/transitfeed that referenced this issue Nov 16, 2017
FeedReader helps access transit feed entries in a compressed archive or
directory.

TransitFeed is a helpful container for transit feed entries that can be
filled with a feed reader.

Addresses some of georust#5.
medwards added a commit to medwards/transitfeed that referenced this issue Nov 16, 2017
FeedReader helps access transit feed entries in a compressed archive or
directory.

TransitFeed is a helpful container for transit feed entries that can be
filled with a feed reader.

Addresses some of georust#5.
medwards added a commit to medwards/transitfeed that referenced this issue Apr 18, 2018
FeedReader helps access transit feed entries in a compressed archive or
directory.

TransitFeed is a helpful container for transit feed entries that can be
filled with a feed reader.

Addresses some of georust#5.
medwards added a commit to medwards/transitfeed that referenced this issue May 2, 2018
FeedReader helps access transit feed entries in a compressed archive or
directory.

TransitFeed is a helpful container for transit feed entries that can be
filled with a feed reader.

Addresses some of georust#5.

thanks @DenZip for the FeedProvider Trait idea
medwards added a commit to medwards/transitfeed that referenced this issue Jun 10, 2018
FeedReader helps access transit feed entries in a compressed archive or
directory.

TransitFeed is a helpful container for transit feed entries that can be
filled with a feed reader.

Addresses some of georust#5.

thanks @DenZip for the FeedProvider Trait idea
medwards added a commit to medwards/transitfeed that referenced this issue Jul 9, 2019
FeedReader helps access transit feed entries in a compressed archive or
directory.

TransitFeed is a helpful container for transit feed entries that can be
filled with a feed reader.

Addresses some of georust#5.

thanks @DenZip for the FeedProvider Trait idea
@teburd
Copy link
Member Author

teburd commented Jul 18, 2019

I think ignoring writing out a transit feed for now is fine, and validation is partially done by simply reading the files the way we are, as there are some formatting checks done already, in my opinion what we have now after a lot of great work from @medwards is 1.0 unless anyone feels otherwise

@medwards
Copy link
Collaborator

I briefly looked over the GTFS Extensions ( https://developers.google.com/transit/gtfs/reference/gtfs-extensions ) and I'm worried about the optional columns that it introduces. Those can't be supported without breaking backwards compatibility right now.

I'd also want to introduce a ShapePoint and StopTime helper before announcing it anywhere but that's not a 1.0 blocker.

@teburd
Copy link
Member Author

teburd commented Jul 22, 2019

@medwards sounds like a plan to me

@derhuerst
Copy link

derhuerst commented Jun 24, 2020

Hey! 👋

I'm currently working on public-transport/gtfs-utils#25, an overhaul to the gtfs-utils JavaScript library. I thought about porting it to Rust for better performance and then found this repo.

My 2 cents on API design from the gtfs-utils/JavaScript perspective:

reusability

It can aggregate shapes/stop times, have various indices and such to help provide the convenience I think most people (myself included) really want to get out of the data that is stored in GTFS and provided by various transit APIs

Answering quite basic (but very relevant in practical usage of GTFS) questions like When does any vehicle depart at a bus stop? is surprisingly much work: GTFS Time values are inherently timezone-dependent, frequencies.txt with exact_times=1 defines "stop times" as well, etc. With the ever-growing number of optional parts and extensions, doing GTFS processing right is a lot of work, so we should make the implementation in this project as reusable/flexible as possible.

Also, a project- and language-independent test suite, i.e. a set of fixtures per "question"/operation, would be very helpful for this. Those have been very successful in other areas, e.g. for WebSocket implementations.

storage-independence

Personally, I really want GTFS to move away from .zip archives. They are inherently unfriendly to many things that GTFS would benefit greatly from: ever-updating "live" feeds, caching, content-addressed storage, sparse replication/access. There are far better tools for transferring/packaging/versioning a set of files!

With gtfs-utils, I try to push towards storage-independent GTFS processing (as in read trips.txt from somewhere, i don't care). Public Rust Traits seem to be a great tool for this.

scalability

GTFS feeds will be significantly larger than the hundreds-of-mb-feeds that are common now; The Germany-wide feed is 2.5GB already, so, for example, a European feed including a lot of shapes will probably be dozens of GB in size.

With gtfs-utils, I therefore try to read as little data into memory as needed for a certain operation, and add a storage API layer for storing intermediate data in other places than memory. In gtfs-utils, this is an async key-value store API that uses memory by default; Again, a publicly exposed Trait seems to be very fitting. This of course still leaves the possiblity open to read all data into memory for low latency and high performance.

If the input files are sorted in a specific way, we can increase processing speed as follows:

  • If two files A and B, with rows of B referencing rows of A, are sorted in a compatible way, we can resolve and process these references iteratively. Example: When computing days of operation for a service from calendar.txt & calendar_dates.txt, and we read a service a from calendar.txt, we can iterate calendar_dates.txt only to the "end" of rows referencing a. All calendar_dates.txt rows before a (that we didn't process as part of a previous calendar.txt iteration) can be discarded because there's no such service defined. All rows after a can be dealt with in the next iteration. This allows us to skip a sorting/hash-map-building step.
  • In many cases, processing can be parallellized, e.g. computing the days of operation of two services, because the data chunks are independent. Sorted files make this a lot easier to implement.

validation

There are a bazillion validation (i.e. "semantic checks on the actual data") cases; The best practices page is long, and the GTFS issue tracker and mailing lists are full of edge cases. There are at east 20 libs across languages doing some form of validation, but none of them cover all the issues that we see with GTFS feeds in the wild.

I'd dare to say that people don't care which language a GTFS validator is written in, but they strongly prefer a certain language for "questions"/analysis. Like the "questions"/analysis mentioned above, validation lends itself to a project- and language-independent set of fixtures, maintained by the wider GTFS community. I hope this will push the overall quality of GTFS feeds, and reduce the amount of duplicated work poured into all those GTFS validation libs. I therefore propose not to put too much effort into validation in this project (I'm obviously just a random stranger telling what to do 😬).

Edit: I have created public-transport/ideas#17 for the out-of-scope task of creating such a cross-project GTFS test suite.

@antoine-de
Copy link

just to give some pointers (and it might gives you some ideas), there are already several rust libraries for GTFS handling:

@derhuerst
Copy link

Just giving an update on #5 (comment) here.

I have implemented public-transport/gtfs-utils#25, gtfs-utils now relies on a specific order in the individual GTFS files, in order to only read those rows into memory that are relevant for a specific merge operation, e.g. when merging stop_times, trips, & calendar/calendar_dates. With JavaScript being inherently unsuited and slow for this type of sequential data processing though, I came back to find out how to do common higher-level GTFS operations (like "Which vehicles stop a stop A at Nov 3rd 7pm?") in Rust.

I'm a Rust junior, so forgive me if I ask such naive questions, but is it true that gtfs-structure is essentially the same thing as this project, except that it can optionally read data into a HashMap? If that is the case, let's discuss merging the two projects!

@teburd
Copy link
Member Author

teburd commented Oct 30, 2020

@derhuerst

This crate provides a lazy iterator over CSV rather than attempting to parse and load the entire GTFS file set into memory all at once.

Something I found particularly painful when writing tflgtfs (transit for london to gtfs)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants