-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use admin IDs to resolve entities during import, add validation for conflicts in entity maps #777
Conversation
…ame values in person and units
…ext-fixtures Update test fixtures to handle differences between UUIDs and entity name values in person and units
@@ -171,6 +171,7 @@ def related_entities(self): | |||
|
|||
|
|||
@versioned | |||
@sourced |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These attributes are, in fact, all sourced. (Violation records come with a single source that the importer associates with each data point.) The complex field module expects sourced data points to be declared sourced and will not return any sources – even if they exist – if data points are not marked as such.
This change makes sourcing more explicit and allows me to use get_sources
in the new test confirming all data points have evidence. Unfortunately, the source creation form behaves differently in that it only expects and creates a source for description, so this change breaks those tests (and the UI).
Per @tlongers, this UI is no longer in use. I have marked the failing tests for skipping, and will open a separate issue to clean up the unused code.
'CompositionChild', | ||
'CompositionParent', | ||
'EmplacementOrganization', | ||
'EmplacementSite', | ||
'EmplacementStartDate', | ||
'MembershipOrganizationMember', | ||
'MembershipOrganizationOrganization', | ||
'MembershipPersonMember', | ||
'MembershipPersonOrganization', | ||
'OrganizationName', | ||
]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None of these sets changed – I just put each item on its own line and sorted them alphabetically, for easier reading.
access_points_for_test = Q() | ||
|
||
for substring in ('alpha', 'beta', 'gamma', 'delta', 'is-member', 'has-member'): | ||
access_points_for_test |= Q(source__title__icontains=substring) | ||
|
||
for access_point in AccessPoint.objects.filter(access_points_for_test): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new test data does not conform with the pattern of source creation we used when we wrote the test. Explicitly query for the sources created for the purpose of this test, so new test data may be added without breaking this test.
@@ -30,7 +32,10 @@ def data_import(location_data_import, data_folder): | |||
|
|||
@pytest.mark.django_db | |||
def test_no_sources_missing(data_import): | |||
assert 'does not have sources' not in data_import.getvalue() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a lot of ways of reporting the same error. These changes add additional log messages to check for. I'll open a separate issue to unify our logging language.
src_related_attrs = [attr for attr in dir(AccessPoint.objects.first()) | ||
if attr.endswith('_related')] | ||
for access_point in AccessPoint.objects.all(): | ||
def test_all_data_points_have_sources(data_import): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New test.
|
||
|
||
@pytest.mark.django_db | ||
def test_sources_only_created_for_data_points_they_evidence(data_import, data_folder): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Existing test, renamed, documented, and reformatted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice cleanup
KEY_TYPE = 'UUID' | ||
VALUE_TYPE = 'name' | ||
|
||
def add(self, key, value, column, row, sheet): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like this might be a bit easier to reason about if you implemented this as a nested dict
__init__(self):
_d = {}
...
if key not in self:
self._d[key] = {}
location = (column, row, sheet)
if value not in self[key]:
self_d[key][value] = [location]
else:
self._d[key][value].append(location)
return self
if you need the data back in a particular way you are getting it now you could
implement this dunder method
def __getitem__(self, key):
return [(k,) + values for k, values in self._d[key].items()]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I'm finding this suggestion much harder to parse. I originally implemented the map as an object with a dictionary attribute here, e182853, but I found that most times I accessed self
, I was accessing self.map
(the dictionary), so it felt more direct for EntityMap
to be a dictionary itself. Is this a change you require?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think you effectively do have a nested structure since the logic, as, I understand it is to have
key
|_ value
|_ location
which you sometimes want to transpose to another nested structure:
value
|_ key
|_ location
i think it's going to be easier to work with this in the future if your data structure reflects that nesting.
so, i do want to see an implementation that has that nesting.
I do not require that you implement this with a dictionary attribute. that's only necessary if you need to flattened results back as in the example __getitem__
if you don't need that flattening, you could do.
if key not in self:
self[key] = {}
location = (column, row, sheet)
if value not in self[key]:
self[key][value] = [location]
else:
self[key][value].append(location)
return self
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for clarifying! I missed the substance of your first comment. 😅 Refactored in ad9d10b.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looking good. one slight improvement possible.
@@ -90,7 +104,7 @@ def get_conflicts(self, transpose=False): | |||
entity_map = self if not transpose else self.get_transposed() | |||
|
|||
for key, values in entity_map.items(): | |||
if len(set(val for val, *_ in values)) > 1: | |||
if len(set(val for val in values.keys())) > 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can be if len(values) > 1
since values (as keys to a dictionary) are guaranteed to be unique
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch, thank you and donezo.
Overview
This PR:
EntityMap
class to keep track of units and people present in the importConnects external issue.
Description of logging behavior
Organizations are referenced by name and UUID in the units sheet, as well as the persons and events sheets. Similarly, people are referenced by name and UUID in the persons and events sheets.
The way I've implemented logging will gather references to a given org or person across all sheets. If there is more than one distinct name value, it will log an error for all rows referencing that UUID, so [SFM has] an audit trail of how the entity is represented across the import. Hopefully, this makes it easier to spot which record is not like the others and should be fixed.
So, say the units sheet has an organization with the UUID 1 named Organization A, and the persons sheet says someone is posted to an organization with the UUID 1 named Organisation A. You'd get a record in the organization errors that looks like:
And one in the person errors that looks like:
Ditto for more than one distinct UUID for a given name value.
Testing Instructions