Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use admin IDs to resolve entities during import, add validation for conflicts in entity maps #777

Merged
merged 19 commits into from
Sep 27, 2021

Conversation

hancush
Copy link

@hancush hancush commented Sep 15, 2021

Overview

This PR:

  • Adds a new EntityMap class to keep track of units and people present in the import
  • Logs when UUIDs have more than one associated name and when names have more than one associated UUID (although the latter is not always an error, the SFM team would appreciate the warnings, because sometimes more than one UUID is minted for the same entity by mistake)
  • Updates the import throughout to resolve entities based on their UUID, rather than their name. This is necessary because the current approach of using a name will collapse distinct entities sharing the same name under one UUID, which has serious implications for data integrity.
  • Includes test data updates c/o @tlongers in Update test fixtures to handle differences between UUIDs and entity name values in person and units #778
  • Documents and clarifies the existing sourcing test and adds a new sourcing test validating that all data points have evidence
  • Adds test for the new resolution and logging behavior

Connects external issue.

Description of logging behavior

Organizations are referenced by name and UUID in the units sheet, as well as the persons and events sheets. Similarly, people are referenced by name and UUID in the persons and events sheets.

The way I've implemented logging will gather references to a given org or person across all sheets. If there is more than one distinct name value, it will log an error for all rows referencing that UUID, so [SFM has] an audit trail of how the entity is represented across the import. Hopefully, this makes it easier to spot which record is not like the others and should be fixed.

So, say the units sheet has an organization with the UUID 1 named Organization A, and the persons sheet says someone is posted to an organization with the UUID 1 named Organisation A. You'd get a record in the organization errors that looks like:

 ${LINE_NUMBER},"Got multiple name values for organization UUID ""1"". Current row contains value ""Organization A"" in column ""unit:name"""

And one in the person errors that looks like:

${LINE_NUMBER},"Got multiple name values for organization UUID ""1"". Current row contains value ""Organisation A"" in column ""person:posting_unit_name"""

Ditto for more than one distinct UUID for a given name value.

Testing Instructions

  • Confirm CI passes

tests/test_importer.py Outdated Show resolved Hide resolved
@@ -171,6 +171,7 @@ def related_entities(self):


@versioned
@sourced
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These attributes are, in fact, all sourced. (Violation records come with a single source that the importer associates with each data point.) The complex field module expects sourced data points to be declared sourced and will not return any sources – even if they exist – if data points are not marked as such.

This change makes sourcing more explicit and allows me to use get_sources in the new test confirming all data points have evidence. Unfortunately, the source creation form behaves differently in that it only expects and creates a source for description, so this change breaks those tests (and the UI).

Per @tlongers, this UI is no longer in use. I have marked the failing tests for skipping, and will open a separate issue to clean up the unused code.

Comment on lines +155 to +165
'CompositionChild',
'CompositionParent',
'EmplacementOrganization',
'EmplacementSite',
'EmplacementStartDate',
'MembershipOrganizationMember',
'MembershipOrganizationOrganization',
'MembershipPersonMember',
'MembershipPersonOrganization',
'OrganizationName',
])
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None of these sets changed – I just put each item on its own line and sorted them alphabetically, for easier reading.

Comment on lines +128 to +133
access_points_for_test = Q()

for substring in ('alpha', 'beta', 'gamma', 'delta', 'is-member', 'has-member'):
access_points_for_test |= Q(source__title__icontains=substring)

for access_point in AccessPoint.objects.filter(access_points_for_test):
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new test data does not conform with the pattern of source creation we used when we wrote the test. Explicitly query for the sources created for the purpose of this test, so new test data may be added without breaking this test.

@@ -30,7 +32,10 @@ def data_import(location_data_import, data_folder):

@pytest.mark.django_db
def test_no_sources_missing(data_import):
assert 'does not have sources' not in data_import.getvalue()
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a lot of ways of reporting the same error. These changes add additional log messages to check for. I'll open a separate issue to unify our logging language.

@hancush hancush changed the title Use admin IDs provided in sheet instead of entity mapping to resolve entities Use admin IDs to resolve entities during import, add validation for conflicts in entity maps Sep 22, 2021
@hancush hancush marked this pull request as ready for review September 22, 2021 20:07
@hancush hancush requested a review from fgregg September 22, 2021 20:08
src_related_attrs = [attr for attr in dir(AccessPoint.objects.first())
if attr.endswith('_related')]
for access_point in AccessPoint.objects.all():
def test_all_data_points_have_sources(data_import):
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New test.



@pytest.mark.django_db
def test_sources_only_created_for_data_points_they_evidence(data_import, data_folder):
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Existing test, renamed, documented, and reformatted.

Copy link
Collaborator

@fgregg fgregg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice cleanup

KEY_TYPE = 'UUID'
VALUE_TYPE = 'name'

def add(self, key, value, column, row, sheet):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like this might be a bit easier to reason about if you implemented this as a nested dict

__init__(self):
    _d = {}
...
if key not in self:
    self._d[key] = {}

location = (column, row, sheet)
if value not in self[key]:
    self_d[key][value] = [location]
else:
    self._d[key][value].append(location)

return self

if you need the data back in a particular way you are getting it now you could
implement this dunder method

def __getitem__(self, key):
    return [(k,) + values for k, values in self._d[key].items()]

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I'm finding this suggestion much harder to parse. I originally implemented the map as an object with a dictionary attribute here, e182853, but I found that most times I accessed self, I was accessing self.map (the dictionary), so it felt more direct for EntityMap to be a dictionary itself. Is this a change you require?

Copy link
Collaborator

@fgregg fgregg Sep 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think you effectively do have a nested structure since the logic, as, I understand it is to have

key
|_ value
    |_ location

which you sometimes want to transpose to another nested structure:

value
|_ key
    |_ location

i think it's going to be easier to work with this in the future if your data structure reflects that nesting.

so, i do want to see an implementation that has that nesting.

I do not require that you implement this with a dictionary attribute. that's only necessary if you need to flattened results back as in the example __getitem__ if you don't need that flattening, you could do.

if key not in self:
    self[key] = {}

location = (column, row, sheet)
if value not in self[key]:
    self[key][value] = [location]
else:
    self[key][value].append(location)

return self

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarifying! I missed the substance of your first comment. 😅 Refactored in ad9d10b.

Copy link
Collaborator

@fgregg fgregg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking good. one slight improvement possible.

@@ -90,7 +104,7 @@ def get_conflicts(self, transpose=False):
entity_map = self if not transpose else self.get_transposed()

for key, values in entity_map.items():
if len(set(val for val, *_ in values)) > 1:
if len(set(val for val in values.keys())) > 1:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be if len(values) > 1 since values (as keys to a dictionary) are guaranteed to be unique

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch, thank you and donezo.

@hancush hancush merged commit 372fe67 into hcg/import-patches Sep 27, 2021
@hancush hancush deleted the hcg/entity-resolution branch September 27, 2021 20:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants