Use admin IDs to resolve entities during import, add validation for conflicts in entity maps #777

hancush · 2021-09-15T14:40:20Z

Overview

This PR:

Adds a new EntityMap class to keep track of units and people present in the import
Logs when UUIDs have more than one associated name and when names have more than one associated UUID (although the latter is not always an error, the SFM team would appreciate the warnings, because sometimes more than one UUID is minted for the same entity by mistake)
Updates the import throughout to resolve entities based on their UUID, rather than their name. This is necessary because the current approach of using a name will collapse distinct entities sharing the same name under one UUID, which has serious implications for data integrity.
Includes test data updates c/o @tlongers in Update test fixtures to handle differences between UUIDs and entity name values in person and units #778
Documents and clarifies the existing sourcing test and adds a new sourcing test validating that all data points have evidence
Adds test for the new resolution and logging behavior

Connects external issue.

Description of logging behavior

Organizations are referenced by name and UUID in the units sheet, as well as the persons and events sheets. Similarly, people are referenced by name and UUID in the persons and events sheets.

The way I've implemented logging will gather references to a given org or person across all sheets. If there is more than one distinct name value, it will log an error for all rows referencing that UUID, so [SFM has] an audit trail of how the entity is represented across the import. Hopefully, this makes it easier to spot which record is not like the others and should be fixed.

So, say the units sheet has an organization with the UUID 1 named Organization A, and the persons sheet says someone is posted to an organization with the UUID 1 named Organisation A. You'd get a record in the organization errors that looks like:

 ${LINE_NUMBER},"Got multiple name values for organization UUID ""1"". Current row contains value ""Organization A"" in column ""unit:name"""

And one in the person errors that looks like:

${LINE_NUMBER},"Got multiple name values for organization UUID ""1"". Current row contains value ""Organisation A"" in column ""person:posting_unit_name"""

Ditto for more than one distinct UUID for a given name value.

Testing Instructions

Confirm CI passes

…entities

tests/test_importer.py

…ame values in person and units

…ext-fixtures Update test fixtures to handle differences between UUIDs and entity name values in person and units

hancush · 2021-09-21T14:06:32Z

violation/models.py

@@ -171,6 +171,7 @@ def related_entities(self):


 @versioned
+@sourced


These attributes are, in fact, all sourced. (Violation records come with a single source that the importer associates with each data point.) The complex field module expects sourced data points to be declared sourced and will not return any sources – even if they exist – if data points are not marked as such.

This change makes sourcing more explicit and allows me to use get_sources in the new test confirming all data points have evidence. Unfortunately, the source creation form behaves differently in that it only expects and creates a source for description, so this change breaks those tests (and the UI).

Per @tlongers, this UI is no longer in use. I have marked the failing tests for skipping, and will open a separate issue to clean up the unused code.

hancush · 2021-09-21T14:10:17Z

tests/test_importer.py

+               'CompositionChild',
+               'CompositionParent',
+               'EmplacementOrganization',
+               'EmplacementSite',
+               'EmplacementStartDate',
+               'MembershipOrganizationMember',
+               'MembershipOrganizationOrganization',
+               'MembershipPersonMember',
+               'MembershipPersonOrganization',
+               'OrganizationName',
+            ])


None of these sets changed – I just put each item on its own line and sorted them alphabetically, for easier reading.

hancush · 2021-09-21T14:11:23Z

tests/test_importer.py

+    access_points_for_test = Q()
+
+    for substring in ('alpha', 'beta', 'gamma', 'delta', 'is-member', 'has-member'):
+        access_points_for_test |= Q(source__title__icontains=substring)
+
+    for access_point in AccessPoint.objects.filter(access_points_for_test):


The new test data does not conform with the pattern of source creation we used when we wrote the test. Explicitly query for the sources created for the purpose of this test, so new test data may be added without breaking this test.

hancush · 2021-09-21T14:15:43Z

tests/test_importer.py

@@ -30,7 +32,10 @@ def data_import(location_data_import, data_folder):

 @pytest.mark.django_db
 def test_no_sources_missing(data_import):
+    assert 'does not have sources' not in data_import.getvalue()


We have a lot of ways of reporting the same error. These changes add additional log messages to check for. I'll open a separate issue to unify our logging language.

hancush · 2021-09-22T20:15:27Z

tests/test_importer.py

-    src_related_attrs = [attr for attr in dir(AccessPoint.objects.first())
-                         if attr.endswith('_related')]
-    for access_point in AccessPoint.objects.all():
+def test_all_data_points_have_sources(data_import):


hancush · 2021-09-22T20:15:46Z

tests/test_importer.py

+
+
+@pytest.mark.django_db
+def test_sources_only_created_for_data_points_they_evidence(data_import, data_folder):


Existing test, renamed, documented, and reformatted.

fgregg

Very nice cleanup

fgregg · 2021-09-27T13:43:50Z

sfm_pc/management/commands/import_google_doc.py

+    KEY_TYPE = 'UUID'
+    VALUE_TYPE = 'name'
+
+    def add(self, key, value, column, row, sheet):


Seems like this might be a bit easier to reason about if you implemented this as a nested dict

__init__(self): _d = {} ... if key not in self: self._d[key] = {} location = (column, row, sheet) if value not in self[key]: self_d[key][value] = [location] else: self._d[key][value].append(location) return self

if you need the data back in a particular way you are getting it now you could
implement this dunder method

def __getitem__(self, key): return [(k,) + values for k, values in self._d[key].items()]

Hmm, I'm finding this suggestion much harder to parse. I originally implemented the map as an object with a dictionary attribute here, e182853, but I found that most times I accessed self, I was accessing self.map (the dictionary), so it felt more direct for EntityMap to be a dictionary itself. Is this a change you require?

i think you effectively do have a nested structure since the logic, as, I understand it is to have

key |_ value |_ location

which you sometimes want to transpose to another nested structure:

value |_ key |_ location

i think it's going to be easier to work with this in the future if your data structure reflects that nesting.

so, i do want to see an implementation that has that nesting.

I do not require that you implement this with a dictionary attribute. that's only necessary if you need to flattened results back as in the example __getitem__ if you don't need that flattening, you could do.

if key not in self: self[key] = {} location = (column, row, sheet) if value not in self[key]: self[key][value] = [location] else: self[key][value].append(location) return self

Thanks for clarifying! I missed the substance of your first comment. 😅 Refactored in ad9d10b.

sfm_pc/management/commands/import_google_doc.py

fgregg

looking good. one slight improvement possible.

fgregg · 2021-09-27T14:46:54Z

sfm_pc/management/commands/import_google_doc.py

@@ -90,7 +104,7 @@ def get_conflicts(self, transpose=False):
        entity_map = self if not transpose else self.get_transposed()

        for key, values in entity_map.items():
-            if len(set(val for val, *_ in values)) > 1:
+            if len(set(val for val in values.keys())) > 1:


this can be if len(values) > 1 since values (as keys to a dictionary) are guaranteed to be unique

Nice catch, thank you and donezo.

Use admin IDs provided in sheet instead of entity mapping to resolve …

5493b89

…entities

hancush commented Sep 15, 2021

View reviewed changes

tests/test_importer.py Outdated Show resolved Hide resolved

tlongers and others added 7 commits September 15, 2021 17:11

Update test fixtures to handle differences between UUIDs and entity n…

edb5f8e

…ame values in person and units

Merge pull request #778 from security-force-monitor/update-importer-t…

ad40887

…ext-fixtures Update test fixtures to handle differences between UUIDs and entity name values in person and units

Debug failing source test, add scaffolding for new tests

4a6aea9

Add test for entities sharing a name

f72d610

Add test that all data is sourced, docs for source assignment

d22f52a

Log an error if two records sharing a UUID have different names

b0f23a8

Skip failing violation tests, pending response

d82c30a

hancush commented Sep 21, 2021

View reviewed changes

Clean up implementation and comments

0cc5707

hancush commented Sep 21, 2021

View reviewed changes

hancush mentioned this pull request Sep 21, 2021

Patch entity resolution, source validation, membership creation and signal management in data import #779

Merged

hancush changed the base branch from hcg/translation to hcg/import-patches September 21, 2021 15:23

hancush and others added 7 commits September 22, 2021 11:00

Add V1 of EntityMap

e182853

Add V2 of EntityMap

645b231

Add docstrings, break logging into its own method

1e54ea2

Expand logging test to cover cases where name has more than one UUID

9f65c4f

Please run CI

f8ac153

Merge branch 'hcg/import-patches' into hcg/entity-resolution

b2509fd

Format from kwargs

c8b9813

hancush changed the title ~~Use admin IDs provided in sheet instead of entity mapping to resolve entities~~ Use admin IDs to resolve entities during import, add validation for conflicts in entity maps Sep 22, 2021

hancush marked this pull request as ready for review September 22, 2021 20:07

hancush requested a review from fgregg September 22, 2021 20:08

hancush commented Sep 22, 2021

View reviewed changes

fgregg requested changes Sep 27, 2021

View reviewed changes

Use nested structure in entity map

ad9d10b

Minor copyedit to docstring

4d0d98b

fgregg requested changes Sep 27, 2021

View reviewed changes

Tighten up values check

1a13346

fgregg approved these changes Sep 27, 2021

View reviewed changes

hancush merged commit 372fe67 into hcg/import-patches Sep 27, 2021

hancush deleted the hcg/entity-resolution branch September 27, 2021 20:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use admin IDs to resolve entities during import, add validation for conflicts in entity maps #777

Use admin IDs to resolve entities during import, add validation for conflicts in entity maps #777

hancush commented Sep 15, 2021 •

edited

Loading

hancush Sep 21, 2021

hancush Sep 21, 2021

hancush Sep 21, 2021

hancush Sep 21, 2021

hancush Sep 22, 2021

hancush Sep 22, 2021

fgregg left a comment

fgregg Sep 27, 2021

hancush Sep 27, 2021

fgregg Sep 27, 2021 •

edited

Loading

hancush Sep 27, 2021

fgregg left a comment

fgregg Sep 27, 2021

hancush Sep 27, 2021

		@@ -171,6 +171,7 @@ def related_entities(self):


		@versioned
		@sourced



		@pytest.mark.django_db
		def test_sources_only_created_for_data_points_they_evidence(data_import, data_folder):

Use admin IDs to resolve entities during import, add validation for conflicts in entity maps #777

Use admin IDs to resolve entities during import, add validation for conflicts in entity maps #777

Conversation

hancush commented Sep 15, 2021 • edited Loading

Overview

Description of logging behavior

Testing Instructions

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fgregg left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fgregg Sep 27, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fgregg left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hancush commented Sep 15, 2021 •

edited

Loading

fgregg Sep 27, 2021 •

edited

Loading