-
Notifications
You must be signed in to change notification settings - Fork 11
Transcriptions/classifications that do not match the task/subject #370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Has this occurred since July 7? Does the transcription that was submitted look like a legit transcription or was it blank? Did this only happen to Herbarium records? |
I think this has been a persistent problem by all accounts... On Tue, Jul 28, 2015 at 2:53 PM, Chris Snyder [email protected]
|
Yes, but I'm trying to narrow down what it's root cause is. My questions serve two purposes:
To note, I don't think it's an API issue, as other projects have not reported similar patterns of classifications from users. So it has to be something within the NfN codebase itself that might cause a user to submit identical transcriptions. |
I am looping in @ammatsun. She can help us answer. |
This is a persistent problem. I just selected July 1~7, 2015 to show that it has occurred recently, but I observed this in records since 2013 (so, not due to a recent change). My best guess at this time, without knowing the code, is that there is some concurrency problem and state from different workers are getting mixed and/or generating this situation. In particular, this might be happening when one worker skips a transcription work, but I could not locate a definitive pattern in the data. I haven't looked at other collections closely, but I just glanced over the macrofungi collection, and found that transcription 5313467447bc7245280007be for subject 52545d9e5c2a110000000b7d (image http://www.notesfromnature.org/subjects/macrofungi/mich/52545d9e5c2a110000000b7d.jpg) has nothing about Canada in it as the transcription indicates, and the exact same transcription is also present for another subject 525468915c2a11000000121e. Differences in start and finish time of 1 second and 2 seconds also point to cases where this issue appears. |
I can confirm that this is happening with CalBug records too. An example is subject 519e5c7eea30523400000457 (EMEC593148 Undetermined sp.jpg). There are 3 transcriptions with the correct locality information (Nevada), and a 4th one with completely different locality information (Minnesota). The 4th record (transcription 54bbfdb9832cec520b0000c3) has the exact same data as transcription 54bbfdb929a6f6290f0000cb, which is for a different specimen. Both records were recorded at almost exactly the same time. The date is January 2015 for the two above-mentioned transcriptions. I remember seeing this problem a year or more ago with the CalBug data. |
Hey @chrissnyder, |
I just wanted to report that I am seeing this issue in the BRIT (herbarium) dataset that came last night. In addition to what has already been reported, I will note one other thing. As noted earlier, start and end times of the problematic records are the same (within each record and across all erroneous records) EXCEPT one record will have an earlier start time. Usually this is the last record in the set, but not always. The record which has this earlier start time seems to be for the image that was transcribed and applied to all other records in the set. |
@denslowm Can you calculate how often you see this happening in the BRIT dataset? Since it appears as such a specific error, is it something you can code into your aggregation script to remove from the data? Have you already shared with the other research teams your approach for removing it, so they do the same with their data? The reason to find out the rate is that this issue and #384 (which may be related) are proving very difficult to reproduce on command for our devs to troubleshoot. I am wondering whether, if the rate is less than a few %, we could do the following:
|
Closing since the app has been relaunched. This was a weirdly intermittent bug in the front end that I never did manage to reproduce. |
There are a number of transcriptions/classifications that do not match the task/subject. The pattern found is that whenever start and finish time are equal, this issue appears.
For example, 10 cases from Jul 1~7, and the transcriptions are completely not matching the image. You can quickly find it strange that the exact same transcription and user appears for a number of different tasks/subjects.
For example on Jul 1, you can find 6 identical transcriptions from foxx86 for the subjects SELU0010136,SELU0006547,SELU0004005,SELU0010669,SELU0006802,SELU0008005.
https://static.zooniverse.org/www.notesfromnature.org/subjects/sernec/selu_images/5570795de3f6661fa5b2055f.jpg
https://static.zooniverse.org/www.notesfromnature.org/subjects/sernec/selu_images/55707958e3f6661fa5b1f77c.jpg
https://static.zooniverse.org/www.notesfromnature.org/subjects/sernec/selu_images/55707954e3f6661fa5b1ed95.jpg
https://static.zooniverse.org/www.notesfromnature.org/subjects/sernec/selu_images/5570795fe3f6661fa5b20773.jpg
https://static.zooniverse.org/www.notesfromnature.org/subjects/sernec/selu_images/55707959e3f6661fa5b1f878.jpg
https://static.zooniverse.org/www.notesfromnature.org/subjects/sernec/selu_images/5570795ae3f6661fa5b1fd29.jpg
This is potentially affecting 5.5% of the transcriptions, which is really significant.
The text was updated successfully, but these errors were encountered: