Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat!: add support for reference deletion and enclose mutations in transactions #63

Merged
merged 7 commits into from
Jul 22, 2020

Conversation

wschurman
Copy link
Member

@wschurman wschurman commented Jul 11, 2020

Why

#19

This PR adds the ability to specify relationships between entities for use during deletion. Three modes are supported:

  • INVALIDATE_CACHE: Invalidates the cache for all entities that reference the entity being deleted. This is most useful when the database itself expresses foreign keys and cascading deletes or set nulls and the entity framework just needs to be kept consistent with the state of the database.
  • CASCADE_DELETE: Deletes the entities that reference the entity being deleted. This is very similar to DBMS ON DELETE CASCADE but is done in the Entity framework instead of at the underlying level, and should not be used in combination with database-schema-expressed foreign keys and deletion behavior. This has the nice effect of keeping the cache consistent as well of course.
  • SET_NULL: Sets fields referencing the entity being deleted to null in any entities that reference the entity being deleted.

Next up: See if it is possible to combine this and association loader in a type safe way. I'm doubting it'll be possible to infer types of edges based on this new schema association specification, but we may at least be able to share the edge definitions for chain loading.

Edit: this PR also now creates a transaction if not already in one for all creates, updates, and deletes. This will become very useful for triggers, but is also useful for dependent deletions added in this PR.

How

The main piece of this PR is the processEntityDeletionForInboundEdgesAsync. See docblock for description. Everything else is tests.

Test Plan

Run all new tests.

@wschurman wschurman requested review from ide and quinlanj July 11, 2020 16:56
@wschurman wschurman force-pushed the @wschurman/references branch from 7f8b0fb to 151a883 Compare July 11, 2020 16:56
@wschurman wschurman changed the title [www] Add support for reference deletion feat: add support for reference deletion Jul 13, 2020
Copy link
Member

@quinlanj quinlanj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious as to what kind of standards we'd want to have when we make our DB tables in www going forward since we are supporting the CASCADE and SET NULL behavior at the Entity level now. Should we always be preferring cascades/set nulls be done at the Entity level and migrate our existing tables to conform to this? Or are there cases where it makes sense to have these done at the DB level?

packages/entity/src/EntityMutator.ts Outdated Show resolved Hide resolved
packages/entity/src/EntityFields.ts Outdated Show resolved Hide resolved
@ide
Copy link
Member

ide commented Jul 13, 2020

CASCADE_DELETE... should not be used in combination with database-schema-expressed foreign keys and deletion behavior.

Reading the code, processEntityDeletionForInboundEdgesAsync runs before the target entity is deleted. So even if Postgres declares ON DELETE CASCADE, we'll delete the child nodes before the parent and simply never make use of Postgres's trigger.


There seems to be a potential bug with INVALIDATE_CACHE: throughout the entities library, we mutate the database first (for updates and deletes) and clear the cache second. But here, INVALIDATE_CACHE causes processEntityDeletionForInboundEdgesAsync to clear the cache (for child entities) first and then mutate the database (for the parent entity) second. It seems to me the more correct behavior would be to perform the DB mutations first and then clear the cache entries afterward.

Edit: I realized that we need to run processEntityDeletionForInboundEdgesAsync before deleting the parent node, because if Postgres automatically deletes the child nodes (or nulls out their FK fields), we won't be able to query Postgres for all of the nodes that reference the parent node...


I agree with the sentiment in Quin's comment about developing some guidance around which policy to use, especially when designing a schema from scratch. I see two dimensions here: should the deletion constraint triggers be the responsibility of the entities or the database? and should a deletion trigger's behavior be to delete the child node (CASCADE DELETE), to null out the reference from the child to parent (SET NULL), or to throw an error because the scenario requires extra application code to properly delete the child node before the parent?

(An example of the latter case: we store the content of some Snacks in Google Cloud Storage. When deleting a Snack, we'd want to run some code to delete the GCS file too, or at least schedule a task to clean up GCS. Postgres can't do this automatically and it's not clear Entities should take care of this either. It's behavior we want in application code. In this case we want the default behavior of ON DELETE RESTRICT to cause the deletion of a parent node to fail if the Snack nodes haven't been deleted.)

Copy link
Member

@ide ide left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requesting changes for some small comments and most notably, the order of operations between clearing the database and clearing the cache for the INVALIDATE_CACHE policy.

packages/entity/src/EntityMutator.ts Outdated Show resolved Hide resolved
packages/entity/src/EntityFields.ts Outdated Show resolved Hide resolved
packages/entity/src/EntityMutator.ts Outdated Show resolved Hide resolved
@ide
Copy link
Member

ide commented Jul 13, 2020

My sense is that when people create new entity relationships (either by adding a new entity type or adding a new edge), they start by modifying the Postgres schema. It's at this time they're thinking most about how deletion triggers should work. So, I think it's more natural to express ON DELETE CASCADE/SET NULL/RESTRICT (the default) in the Postgres schema and let entities take care of clearing the cache.

However, I see two flaws with this approach in contrast with having entities perform the deletion triggers:

(1) The deletion trigger policy for Postgres is dislocated from the deletion trigger policy for the cache. It is easy to declare ON DELETE CASCADE in Postgres and then forget to define an association between the entities in a follow-up PR.

(2) Generally speaking, bypassing the entities library to do a mutation is a source of problems because the entities library needs to know about all mutations for cache consistency. Postgres is bypassing the entities library when it processes ON DELETE CASCADE/SET NULL. (In this specific scenario it can be safe since the entities library knows that Postgres is going to perform those mutations...)

@wschurman
Copy link
Member Author

wschurman commented Jul 14, 2020

There seems to be a potential bug with INVALIDATE_CACHE: throughout the entities library, we mutate the database first (for updates and deletes) and clear the cache second. But here, INVALIDATE_CACHE causes processEntityDeletionForInboundEdgesAsync to clear the cache (for child entities) first and then mutate the database (for the parent entity) second. It seems to me the more correct behavior would be to perform the DB mutations first and then clear the cache entries afterward.

Edit: I realized that we need to run processEntityDeletionForInboundEdgesAsync before deleting the parent node, because if Postgres automatically deletes the child nodes (or nulls out their FK fields), we won't be able to query Postgres for all of the nodes that reference the parent node...

Yeah this is something I debated quite a bit while writing it. I had two choices for INVALIDATE_CACHE:

  • Do the deletion in three steps: gather all the entities to be invalidated in one step, run the delete of the parent entity (thus deleting the underlying dependent data from the database), invalidate all the gathered entities.
  • Keep the code consistent between all deletion behavior modes (keep it all in the same switch). The only downside to this is that we could unnecessarily clear the cache if the deletion fails for some reason, but it won't cause any inconsistencies.
    It's really hard to know what the best tradeoff is: keeping the code simple-ish or keeping invalidations consistent.

I'll try to change it and see if I can do it cleanly.

I agree with the sentiment in Quin's comment about developing some guidance around which policy to use, especially when designing a schema from scratch. I see two dimensions here: should the deletion constraint triggers be the responsibility of the entities or the database? and should a deletion trigger's behavior be to delete the child node (CASCADE DELETE), to null out the reference from the child to parent (SET NULL), or to throw an error because the scenario requires extra application code to properly delete the child node before the parent?

I'm pretty unsure on what guidance to provide as well. I think it depends on the type of database being used. For a nosql database (which theoretically entity could support) it would make more sense to do it in the entity layer. For relational DBs I really like specifying foreign keys but I know that at certain scale foreign keys become a huge pain point. The rails convention is for relational DBs to express relationships only in application code (only recently did they start to support real foreign keys).

My gut says that people will have their opinions, and the framework should just support all of them equally. Concretely for us in www my gut says keep using real foreign keys.

(An example of the latter case: we store the content of some Snacks in Google Cloud Storage. When deleting a Snack, we'd want to run some code to delete the GCS file too, or at least schedule a task to clean up GCS. Postgres can't do this automatically and it's not clear Entities should take care of this either. It's behavior we want in application code. In this case we want the default behavior of ON DELETE RESTRICT to cause the deletion of a parent node to fail if the Snack nodes haven't been deleted.)

Non-relational triggers are something that will be needed pretty soon for audit logging. This almost makes me think that we may need to rethink ordering of operations during delete and whether we need to topologically sort trigger, db deletion, and cache execution for all dependent entities.

I'll spend some time thinking about ordering and triggers.

@wschurman
Copy link
Member Author

One interesting thing about rails: both the save and destroy methods are automatically wrapped in a transaction if one isn't provided. I'm starting to think we should do the same here so that we can do better ordering.

@ide
Copy link
Member

ide commented Jul 15, 2020

Do the deletion in three steps: gather all the entities to be invalidated in one step, run the delete of the parent entity (thus deleting the underlying dependent data from the database), invalidate all the gathered entities.

What seems trickiest about this is that the third step is to invalidate the cache for entities that no longer exist in the database. They are kind of like phantom entities. If we made it easy to work with and invalidate these phantom entities, it might be easy to implement this three-phase deletion.

both the save and destroy methods are automatically wrapped in a transaction if one isn't provided. I'm starting to think we should do the same here so that we can do better ordering.

This sounds like a good idea.

@codecov
Copy link

codecov bot commented Jul 15, 2020

Codecov Report

Merging #63 into master will increase coverage by 0.34%.
The diff coverage is 96.33%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #63      +/-   ##
==========================================
+ Coverage   94.07%   94.42%   +0.34%     
==========================================
  Files          58       58              
  Lines        1384     1452      +68     
  Branches      147      159      +12     
==========================================
+ Hits         1302     1371      +69     
+ Misses         81       79       -2     
- Partials        1        2       +1     
Flag Coverage Δ
#integration 94.42% <96.33%> (+0.34%) ⬆️
#unittest 94.42% <96.33%> (+0.34%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
packages/entity/src/EntityCompanion.ts 100.00% <ø> (ø)
packages/entity/src/EntityDatabaseAdapter.ts 100.00% <ø> (ø)
packages/entity/src/internal/EntityDataManager.ts 100.00% <ø> (ø)
packages/entity/src/EntityQueryContext.ts 94.44% <91.66%> (-5.56%) ⬇️
packages/entity/src/EntityMutator.ts 97.52% <94.91%> (-2.48%) ⬇️
...ter-knex/src/PostgresEntityQueryContextProvider.ts 100.00% <100.00%> (ø)
...ample/src/adapters/InMemoryQueryContextProvider.ts 100.00% <100.00%> (+33.33%) ⬆️
packages/entity/src/Entity.ts 100.00% <100.00%> (+12.00%) ⬆️
packages/entity/src/EntityAssociationLoader.ts 100.00% <100.00%> (ø)
packages/entity/src/EntityConfiguration.ts 100.00% <100.00%> (ø)
... and 7 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9c92d10...f0a1219. Read the comment docs.

@wschurman
Copy link
Member Author

Do the deletion in three steps: gather all the entities to be invalidated in one step, run the delete of the parent entity (thus deleting the underlying dependent data from the database), invalidate all the gathered entities.

What seems trickiest about this is that the third step is to invalidate the cache for entities that no longer exist in the database. They are kind of like phantom entities. If we made it easy to work with and invalidate these phantom entities, it might be easy to implement this three-phase deletion.

After trying this, I'm even more convinced that keeping the code simpler in exchange for potentially erroneous cache invalidations is the right tradeoff. Worst case it would invalidate the cache of an entity who's deletion fails and the next time it is loaded it will be loaded from the database. No inconsistencies would arise.

both the save and destroy methods are automatically wrapped in a transaction if one isn't provided. I'm starting to think we should do the same here so that we can do better ordering.

This sounds like a good idea.

Updated the PR to add this.

@wschurman wschurman requested review from quinlanj and ide July 15, 2020 04:23
@wschurman wschurman changed the title feat: add support for reference deletion feat!: add support for reference deletion Jul 15, 2020
@wschurman wschurman changed the title feat!: add support for reference deletion feat!: add support for reference deletion and enclose mutations in transactions Jul 15, 2020
Copy link
Member

@quinlanj quinlanj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On transactions:

  • Agree this is a really good idea to require edge deletions be done in a transaction. At the Postgres layer, the default behaviour is RESTRICT, so it'd enforce fkey integrity by erroring out if any of the edge deletions failed and we tried to delete the main entity. But in the case where its a No-SQL DB, there's nothing at the DB layer to enforce these constraints so a transaction would protect against this.

On www schema guidance:

  • Thinking about this some more, it makes sense to make the policy at the DB level stricter than what is enforced at the Entity level. In the Postgres case, it would be going with the default RESTRICT policy.
  • In the case where there is a bug in the Entity framework:
    - If the DB policy is always stricter than what is required by the Entity framework (RESTRICT), then a bad Entity deletion attempt will just result in an error at the DB level. The worst case here is that the Entities are deleted from cache but all the data is still consistent. The DB will act as a failsafe to enforce fkey integrity.
    - If the DB policy is equivalent or looser than what is required by the Entity framework (ON CASCADE or a noSQL db that doesnt enforce fkey integrity), then you start getting cases of inconsistent data. FKEY integrity would get compromised in the noSQL case, and in the ON CASCADE case, the cache could be inconsistent with the DB because the DB could have done the ON CASCADE work and the Entities framework would not have known to invalidate the cache

@ide
Copy link
Member

ide commented Jul 15, 2020

make the policy at the DB level stricter than what is enforced at the Entity level.

This sounds like good guidance. We can always change RESTRICT to CASCADE if the entities properly declare the association between two entity types.

@wschurman one thing that came to mind that might be good to test is transitively cascading deletions. Say entity A references B which references C. When C is deleted, we delete B, which should delete A. I think the current code neatly takes care of this recursively but it's something to think about. (Also, do we need some kind of cycle detection in the assocs?)

@wschurman
Copy link
Member Author

On www schema guidance:

  • Thinking about this some more, it makes sense to make the policy at the DB level stricter than what is enforced at the Entity level. In the Postgres case, it would be going with the default RESTRICT policy.
  • In the case where there is a bug in the Entity framework:
    • If the DB policy is always stricter than what is required by the Entity framework (RESTRICT), then a bad Entity deletion attempt will just result in an error at the DB level. The worst case here is that the Entities are deleted from cache but all the data is still consistent. The DB will act as a failsafe to enforce fkey integrity.
    • If the DB policy is equivalent or looser than what is required by the Entity framework (ON CASCADE or a noSQL db that doesnt enforce fkey integrity), then you start getting cases of inconsistent data. FKEY integrity would get compromised in the noSQL case, and in the ON CASCADE case, the cache could be inconsistent with the DB because the DB could have done the ON CASCADE work and the Entities framework would not have known to invalidate the cache

Agreed. I think eventually the framework will reach a point when it's essentially guaranteed to be have full coverage and stability for referential integrity and we'll be able to recommend either (like rails/activerecord does), but until then I agree that this sounds like good advice.

one thing that came to mind that might be good to test is transitively cascading deletions. Say entity A references B which references C. When C is deleted, we delete B, which should delete A. I think the current code neatly takes care of this recursively but it's something to think about. (Also, do we need some kind of cycle detection in the assocs?)

The current code definitely handles this but I agree that we should have a test for it. As for cycle detection, that's definitely something that needs a test. I can think of two types of cycles:

  • A owns a B. A owns a C. B owns a C. A is deleted, and therefore B and C should be deleted.
  • A owns a B. B owns a different A via a different field of A. Or something like this.

I'll add some tests for these cases and a simple transitive one as well.

packages/entity/src/EntityFields.ts Outdated Show resolved Hide resolved
packages/entity/src/EntityFields.ts Outdated Show resolved Hide resolved
packages/entity/src/EntityFields.ts Outdated Show resolved Hide resolved
@wschurman
Copy link
Member Author

Updated the PR to support cyclical references, and test in database adapter and unit tests.

Note that keeping the recursive base case (check to ensure the entity hasn't already been processed before) required slight changes to existing behavior where a database adapter would throw if it didn't delete anything, but thinking about that behavior more I think it's fine and probably correct since in a distributed system a single ID could be deleted by multiple processes at the same time. An alternative here would be to additionally check that the entity hasn't been processed before in deleteInternalAsync itself so that the delete call to the database adapter never occurs. Might be worth pairing on this briefly tomorrow to compare the two options and share ideas.

@wschurman wschurman requested a review from ide July 22, 2020 03:48
packages/entity/src/EntityMutator.ts Outdated Show resolved Hide resolved
@wschurman wschurman merged commit 1224de7 into master Jul 22, 2020
@wschurman wschurman deleted the @wschurman/references branch July 22, 2020 16:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants