I believe there’s an issue with your process at work here based on this:
It is expected for us to find a duplicate copy of ID’d photos in the system bec the ID kits each contain a re-named copy of the raw tourist pic. The raw tourist pic remained as is in the relevant sighting folder, which was uploaded, as were the ID kits.
The system does not support this behavior. IA does not recognize duplicate images as different; it returns an image ID, which applies to all identical images, and Wildbook selects from the related encounters at random. This also can lead to one image assigned to different encounters/individuals, which causes a 606 error.
To clarify, and trying not to sound too defensive, this wasn’t a deliberate process; we were unaware that duplicate images with different names were retained in the source data this way when we uploaded both the ID’d dataset and the census data. Since they have different file names, there was and is really no way for us to find these and remove the duplicates prior to upload.
Next, it’s not clear to us that the system doesn’t support this; it identifies the duplicate for us and tags it in the “alternate reference” box in the match results. Obviously, it also allows this in uploads. Both of which could imply that the system does support it to some degree.
So we were unaware of the impact of this scenario on matching - that is, if what you’re saying is that this duplicate image scenario is the source of all of the issues I listed here (except for what’s in yesterday’s ticket, now being tracked under WB-1154)? Can you confirm that’s what your reply means?
Over the course of setting up your Wildbook, we mentioned some of the more common issues that can crop up, including duplicate images impacting match results, such as discussions around bulk import allowing duplicate images (it has no context of the image, IA does), and that things like “alternate reference” act as sign posts that something is off (this led to the development of the administrative tools that help to resolve things like 606 errors by allowing admins to actively seek out data issues).
That being said, I will make a point of updating documentation to reflect this information, and make a note for others we onboard that this is an issue whose importance we must underscore.
What I can say with certainty is this is a major problem that is causing the match result confusion. It could lead to additional issues such as viewpoint misassignment and such because errors cascade out in a machine learning platform.
A little addendum about finding and removing duplicates:
When you see one of these alternate references, choose which of them you wish to keep and delete the other so it doesn’t come up again.
As an additional step, visit the link to the import page (found at the very bottom of the Metadata section) for the encounter you are going to delete. It seems likely to me that an import that created one duplicate might be responsible for others and is a good starting point to find and remove them. If the import is all duplicates you might be able to save some time by using the new import deletion button to get rid of them all.
If you aren’t sure whether one of the encounters in an import is a duplicate or not, then run a matching job. If it is you will see an ‘Alternate Reference’ in iaResults just like with the original encounter and can safely delete it.
Thanks for clarifying. I think something that might have been helpful to us and will be to other new users is a better understanding from the get go of issues that users can cause and the severity of the impact of those issues. We received a lot of information and advice about what to do and what not to do, well before we had any decent understanding of how the system worked. We’re still learning.
In the case of “alternate reference”, when I specifically asked what this was in a support ticket filed 2 months ago, the response was that “this is just to make you aware that the annotation exists elsewhere in the system to help you avoid duplication and aid data curation.” I was not told that it will break matching. If I had known that then, we’d have avoided having the researcher get very deep into her matching process without correcting these.
I also feel that just being told (although we weren’t) that duplicates cause issues with matching is too vague and lacks specificity that could help drive some urgent action, particularly with new users to the system.
With the new data integrity tool, knowing that annotations being assigned to 2 different individuals is interesting and helpful but, for our WD researchers, not an urgent concern bec, from their perspective, all these individuals exist as distinct individuals, so the problem is purely that an image has been assigned to one of them incorrectly, in each instance. But I’m guessing now the impact is far more severe than that from an ML perspective, although I have no basis for that assumption other than extrapolating from this scenario.
As to cleanup, unfortunately Colin, your recommended approach won’t work for this particular dataset and, I suspect, others. There is no single import that created the duplicates. ID kits were built from what the researcher decided were the best L & R viewpoints they could find in a dataset of thousands of tourist images. So finding these duplicates would mean determining which images in the system are the original ID kit images for each individual, then opening each of the thousands of tourist images to find the matches. Even doing it via matching in the system is massive - the first 12 best matches do not necessarily always surface the “alternate reference” images. But obviously, we’ll need to figure out a way.
Meanwhile, I’d recommend being clearer about showstopper-type issues over minor inconvenience issues, or even just a Do’s & Don’ts type list for new users.