-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parsing instruments as functions [bug?] #12
Comments
There isn't a bug in the parsing here, per se—the source of all these instrument labels in the function labels is that these instrument labels are missing a bracket (either opening or closing), which is what distinguishes the functions from instrument tags. Instrument tags are supposed to be open, close, or self-contained, like "(trumpet", "bass)" or "(vocal)". The function vocabulary is intended to be limited, to a set of I think around 20 terms. But at certain points in the development of the data, we allowed more terms (and didn't go back and revise all the existing data as the process evolved), and annotators occasionally used their best judgement to add more (like "piece_1" and "piece_2" for when a particular track really seemed to consist of two independent songs). |
Anyway, as it happens I already re-wrote the parser in Python; I should polish it up and upload it. |
I see -- so, had the functions actually kept to a closed vocab, it would be possible to tease out instrumentation without needing parentheses. But, as the data currently exists, that's not possible? It does seem like there are some legitimate bugs though: eg "d/b" or "b/c'", which really are lower-case segments that the annotator couldn't decide on. These seem rare though, and maybe could be fixed with a little bit of additional manual inspection? At any rate, it does seem like the behavior here is not as intended: things that are not "functions" end up in the "functions" annotations, primarily because of missing parens. Would it be possible to go through and add parens to anything that's obviously non-functional? This could be done programmatically without too much work, since we have a finite sample and a relatively small vocab. |
I think that to truly fix the annotations will require manual inspection. However, you could apply conservative patches in the meantime. For functions: ignore function words that fall outside the agreed-upon vocabulary (see page 9 of the Annotator's Guide). For instruments: close up any tags that are left open or were never opened. For example, an unclosed "(vocal" becomes "(vocal)", an unopened "vocal)" becomes "(vocal)". But the conservative patch you apply would depend on your usage. If you're training a neural net with positive and negative examples of clips with certain leading instrumentation, your strategy may change. |
To elaborate: I would like to go in and manually fix some of the instrument issues, and any function / letter label issues that clearly derive from typos. But some issues I would rather leave unfixed, like an annotator using a special function word, or throwing up their hands and saying "b/c" in lieu of "b" or "c". In these cases, I would rather just make available a standardized (but not human-authored) version. For functions, this would mean choosing a mapping of all non-standard function words to standard words, the same way the Annotator's Guide anticipates that you could simplify the set {"pre-verse", "pre-chorus", "interlude", "transition"} --> "transition". And for non-standard letters, a standard could be: "assign ambiguous cases a new letter." |
Okay. I've done this kind of thing before, but I would much prefer that we have a standardized "clean" version that everyone can use.
Isn't that just equivalent to discarding the annotation? The labels should apply to labels, not boundaries, so if you have a boundary marked as More generally: how do you feel about migrating the whole thing to an interval-based representation instead of boundaries? It would solve a lot of headaches.
Agreed. Any ideas on how to do that? They seem like such a limited set of cases that it may as well be done manually ; though maybe things are different on the full (private) dataset.
Sure. It seems like your vocabulary file already does the hard work there
What's a "non-standard letter"? You mean like |
Splitting off from #10: the instrument annotations are distinct from "functions", and as far as I understand, should not be included in
_functions
annotations. @jblsmith correct me if I'm wrong here?If that's the case, there are several parsing errors in the pre-parsed csv files (see below).
At one point, I had cleaned this stuff up in a notebook, but I think it would be all around better to just fix/rewrite the parser. (Since I don't speak ruby, I'd just redo it in python.) Any objections?
The text was updated successfully, but these errors were encountered: