Skip to content

Transcriptions/classifications that do not match the task/subject #370

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
denslowm opened this issue Jul 27, 2015 · 10 comments
Closed

Transcriptions/classifications that do not match the task/subject #370

denslowm opened this issue Jul 27, 2015 · 10 comments

Comments

@denslowm
Copy link

There are a number of transcriptions/classifications that do not match the task/subject. The pattern found is that whenever start and finish time are equal, this issue appears.
For example, 10 cases from Jul 1~7, and the transcriptions are completely not matching the image. You can quickly find it strange that the exact same transcription and user appears for a number of different tasks/subjects.

For example on Jul 1, you can find 6 identical transcriptions from foxx86 for the subjects SELU0010136,SELU0006547,SELU0004005,SELU0010669,SELU0006802,SELU0008005.
https://static.zooniverse.org/www.notesfromnature.org/subjects/sernec/selu_images/5570795de3f6661fa5b2055f.jpg
https://static.zooniverse.org/www.notesfromnature.org/subjects/sernec/selu_images/55707958e3f6661fa5b1f77c.jpg
https://static.zooniverse.org/www.notesfromnature.org/subjects/sernec/selu_images/55707954e3f6661fa5b1ed95.jpg
https://static.zooniverse.org/www.notesfromnature.org/subjects/sernec/selu_images/5570795fe3f6661fa5b20773.jpg
https://static.zooniverse.org/www.notesfromnature.org/subjects/sernec/selu_images/55707959e3f6661fa5b1f878.jpg
https://static.zooniverse.org/www.notesfromnature.org/subjects/sernec/selu_images/5570795ae3f6661fa5b1fd29.jpg

This is potentially affecting 5.5% of the transcriptions, which is really significant.

@chrissnyder
Copy link
Contributor

Has this occurred since July 7? Does the transcription that was submitted look like a legit transcription or was it blank? Did this only happen to Herbarium records?

@robgur
Copy link

robgur commented Jul 28, 2015

I think this has been a persistent problem by all accounts...

On Tue, Jul 28, 2015 at 2:53 PM, Chris Snyder [email protected]
wrote:

Has this occurred since July 7? Does the transcription that was submitted
look like a legit transcription or was it blank? Did this only happen to
Herbarium records?


Reply to this email directly or view it on GitHub
#370 (comment)
.

@chrissnyder
Copy link
Contributor

Yes, but I'm trying to narrow down what it's root cause is. My questions serve two purposes:

  • Was this caused by code added around the first of the month? If so, that's a lot less area to look.
  • Was this seen in only one collection? If so, we can focus troubleshooting where that collection sends its data off to the API.

To note, I don't think it's an API issue, as other projects have not reported similar patterns of classifications from users. So it has to be something within the NfN codebase itself that might cause a user to submit identical transcriptions.

@denslowm
Copy link
Author

I am looping in @ammatsun. She can help us answer.
We know it is the herbarium for sure at this point.

@ammatsun
Copy link

This is a persistent problem. I just selected July 1~7, 2015 to show that it has occurred recently, but I observed this in records since 2013 (so, not due to a recent change). My best guess at this time, without knowing the code, is that there is some concurrency problem and state from different workers are getting mixed and/or generating this situation. In particular, this might be happening when one worker skips a transcription work, but I could not locate a definitive pattern in the data.

I haven't looked at other collections closely, but I just glanced over the macrofungi collection, and found that transcription 5313467447bc7245280007be for subject 52545d9e5c2a110000000b7d (image http://www.notesfromnature.org/subjects/macrofungi/mich/52545d9e5c2a110000000b7d.jpg) has nothing about Canada in it as the transcription indicates, and the exact same transcription is also present for another subject 525468915c2a11000000121e.

Differences in start and finish time of 1 second and 2 seconds also point to cases where this issue appears.

@JoyceGross
Copy link

I can confirm that this is happening with CalBug records too.

An example is subject 519e5c7eea30523400000457 (EMEC593148 Undetermined sp.jpg). There are 3 transcriptions with the correct locality information (Nevada), and a 4th one with completely different locality information (Minnesota).

The 4th record (transcription 54bbfdb9832cec520b0000c3) has the exact same data as transcription 54bbfdb929a6f6290f0000cb, which is for a different specimen. Both records were recorded at almost exactly the same time.

The date is January 2015 for the two above-mentioned transcriptions.

I remember seeing this problem a year or more ago with the CalBug data.

@denslowm
Copy link
Author

Hey @chrissnyder,
I just wanted to check to see if you have any updates on this issue. What do we need to do to move this forward?

@denslowm
Copy link
Author

I just wanted to report that I am seeing this issue in the BRIT (herbarium) dataset that came last night.

In addition to what has already been reported, I will note one other thing.

As noted earlier, start and end times of the problematic records are the same (within each record and across all erroneous records) EXCEPT one record will have an earlier start time. Usually this is the last record in the set, but not always. The record which has this earlier start time seems to be for the image that was transcribed and applied to all other records in the set.

@trouille
Copy link

@denslowm Can you calculate how often you see this happening in the BRIT dataset? Since it appears as such a specific error, is it something you can code into your aggregation script to remove from the data? Have you already shared with the other research teams your approach for removing it, so they do the same with their data?

The reason to find out the rate is that this issue and #384 (which may be related) are proving very difficult to reproduce on command for our devs to troubleshoot. I am wondering whether, if the rate is less than a few %, we could do the following:

  1. message very clearly in Talk that we're aware of the problem and also the rate at which it is happening and how you're removing it from the results so we know it's not contaminating the research
  2. have people keep posting in Talk when they see it happen so we can have a sense for whether there's suddenly an uptick in breaks of this type
  3. focus dev effort on the new platform rather than spending significantly more time trying to find the solution to this problem that may have a limited enough impact that we're willing to remove it in post-processing

@parrish
Copy link
Contributor

parrish commented Jun 27, 2016

Closing since the app has been relaunched. This was a weirdly intermittent bug in the front end that I never did manage to reproduce.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants