Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not every boundary has a lowercase annotation #13

Open
bmcfee opened this issue Jun 21, 2016 · 7 comments
Open

Not every boundary has a lowercase annotation #13

bmcfee opened this issue Jun 21, 2016 · 7 comments

Comments

@bmcfee
Copy link

bmcfee commented Jun 21, 2016

As per label rules, point (a):

Letter labels: The entire piece must be fully labelled with both uppercase
and lowercase letter labels. That is, no time-span should be unlabelled at
any hierarchical level. The only exceptions are the special labels described below. 

a) Lowercase labels: Every boundary must be provided with a lowercase letter label.
b) Uppercase labels: While the entire piece must be fully labelled with uppercase letter labels, not each section needs to be labelled individually: an uppercase label is assumed to persist until the next uppercase label. 

These points seem to imply that every uppercase boundary should coincide with a lowercase boundary. This doesn't seem to be true in the data though, eg, 100/textfile2.txt. I ran a script to find this kind of anomaly, and tallied about 451 files with significant (>3s) disagreements between upper and closest lower-case boundaries.

How should these be interpreted?

@jblsmith
Copy link
Contributor

Yep, that's a data error! There should be a lowercase label on every line, basically. I'm not sure what the other 451 instances look like, but in this case I would assume a unique lowercase label for that segment. Alternatively, the "G" might be a mistyped "g". One would have to listen to the track to decide.

If you have the list of 451 files (and can automatically point to the lines representing the discrepancies), I'd be happy to take a look at more!

@bmcfee
Copy link
Author

bmcfee commented Jun 27, 2016

Here's a notebook to detect misaligned upper-lower segments within a tolerance threshold, along with the list of offending inputs and annotations. These were computed on the latest pull from this repo.

@jblsmith
Copy link
Contributor

Thanks! Very nifty. Looking at this, I realize something I forgot in my own
parser: among uppercase labels, "Z" was actually a kind of reserved label
that meant "non-music":

  • most often, pre-music applause in a rough live recording—hence its
    frequency in the Live Music Archive, SALAMI IDs ~1000–1500;
  • or, sometimes, a post-song dialogue, as in a soundtrack track.

So it's to be treated a bit like a "silence" or "end" tag, in that it is
relevant to all parsed annotations (upper/lower/func/instr). At a glance,
that might resolve 90% of the files; the true errors seem concentrated in a
smaller bunch.

On 28 June 2016 at 04:02, Brian McFee [email protected] wrote:

Here's a notebook
https://gist.github.com/a4bde985242bb910d9db41b8bf550b72 to detect
misaligned upper-lower segments within a tolerance threshold, along with
the list of offending inputs and annotations. These were computed on the
latest pull from this repo.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#13 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AFJn10ytsRDrLRik3fDjIlC4kbyEWAI_ks5qQB5BgaJpZM4I7CH2
.

@bmcfee
Copy link
Author

bmcfee commented Jun 28, 2016

Ok.. does it make sense to insert a z boundary for every Z boundary then? Otherwise, a significant amount of information is lost when only looking at the lowercase annotations.

Otherwise, what do you want to do about the "true errors"?

@bmcfee
Copy link
Author

bmcfee commented Jul 28, 2016

Pinging back on this: is it kosher to just propagate uppers->lowers for Z segments and treat the remaining missing boundaries as errors to be corrected?

@jblsmith
Copy link
Contributor

Yes, I think that makes sense!

On 29 July 2016 at 03:05, Brian McFee [email protected] wrote:

Pinging back on this: is it kosher to just propagate uppers->lowers for Z
segments and treat the remaining missing boundaries as errors to be
corrected?


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#13 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AFJn1xtwLnAtGf2-2f2GRTa6CsPuRdw5ks5qaO9SgaJpZM4I7CH2
.

@bmcfee
Copy link
Author

bmcfee commented Jul 29, 2016

Yes, I think that makes sense!

Great. I can do this in a PR and send it back upstream to you, if you like. I'd like to avoid redundant work though -- it seems like there's already a list of errors in place in the new_parser branch. Will that be fixed/merged in the near future, or should i work independently on the upper-lower cleanup?

(Note: i'm working toward an 8/26 deadline for this, so the sooner i can get this set, the better. Not to impose my own constraints on you or anything 😁 )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants