Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracking progress of Datasets & Entities #649

Closed
17 tasks done
ktuite opened this issue Oct 26, 2022 · 22 comments
Closed
17 tasks done

Tracking progress of Datasets & Entities #649

ktuite opened this issue Oct 26, 2022 · 22 comments
Labels
entities Multiple Encounter workflows

Comments

@ktuite
Copy link
Member

ktuite commented Oct 26, 2022

This issue tracks progress of the initial implementation of Datasets and Entities, including PRs, to-dos, and issues/bugs that we discover along the way. It is patterned off of a frontend issue about a large code migration.

entities branch

PR #576 represents the entities branch, a large initial chunk of work that covers:

  • Parsing XML
    • Parsing Form XML to get dataset information
      • Validation
    • Parsing Submission XML to make entities
      • Validation
  • Dataset management
    • New schema:
      • Dataset
      • Dataset Properties
      • Dataset Property Fields
    • Create/update datasets on form uploads
    • New API endpoints:
      • List datasets
      • View dataset diff (what effect will publishing a draft form have on a dataset)
      • Link/unlink a dataset to a from that consumes a dataset as an attachment
  • Entity management
    • New schema
      • Entity
      • Entity def
    • Worker that looks for approved submissions and tries to make entities out of them
  • Entity/dataset CSV serving
    • Stream out entities as CSV for forms to use as attachments
  • Audit logging
    • Logging dataset create/update events
    • Logging entity create events
    • Exposing dataset and entity actions through API

Things to get into this PR before merging


Critical to do before release:

Likely before release:

  • Revisit entity-insertion pre-check query (here)
  • Figure out new names for Datasets query module functions getByName and getByProjectAndName (here)
  • Revisit Actee provisioning on Dataset creation via SQL (here)

Collecting issues / unexpected behavior / bugs:

  • Making an entity out of a submission that is from a form that isn't actively linked to a dataset (saw this when rolling back and reapplying a migration, may be hard to recreate this bug but could point so some other underlying issue)
  • Bug: re-approving a submission and re-making the entity makes things crash :(

To come back to after release:

@ktuite ktuite linked a pull request Oct 29, 2022 that will close this issue
@ktuite ktuite reopened this Nov 1, 2022
@matthew-white matthew-white removed a link to a pull request Nov 2, 2022
@matthew-white

This comment was marked as resolved.

@matthew-white

This comment was marked as resolved.

@matthew-white matthew-white pinned this issue Nov 2, 2022
@matthew-white

This comment was marked as resolved.

@matthew-white
Copy link
Member

matthew-white commented Nov 3, 2022

Main issue: #666

It looks like it's currently not possible to purge forms if a soft-deleted form has a submission that has created an entity. I tried to write a test along those lines here:

it('should allow a form that has created an entity to be purged', testService((service, container) =>
  service.login('alice', (asAlice) =>
    asAlice.post('/v1/projects/1/forms?publish=true')
      .set('Content-Type', 'application/xml')
      .send(testData.forms.simpleEntity)
      .expect(200)
      .then(() => asAlice.post('/v1/projects/1/forms/simpleEntity/submissions')
        .send(testData.instances.simpleEntity.one)
        .set('Content-Type', 'application/xml')
        .expect(200))
      .then(() => asAlice.patch('/v1/projects/1/forms/simpleEntity/submissions/one')
        .send({ reviewState: 'approved' })
        .expect(200))
      .then(() => exhaust(container))
      .then(() => asAlice.delete('/v1/projects/1/forms/simpleEntity')
        .expect(200))
      .then(() => container.Forms.purge(true)))));

Right now the test fails with:

error: update or delete on table "submission_defs" violates foreign key constraint "entity_defs_submissiondefid_foreign" on table "entity_defs"


Once an entity is made from a submission, it lives its own lifecycle and deleting that submission (or parent form) won't alter the entity. Though at some point we need to handle entity deletion.

For now:

  1. remove FK constraints from entity tables to form and submission tables
  2. change delete actions - on delete set null
  3. look into a tombstone solution like how form purging keeps the form actee id around with minimal details about the deleted form

@matthew-white
Copy link
Member

matthew-white commented Nov 3, 2022

Main issue: #666

Here's another one related to form purging, more of an edge case. If all the forms related to a dataset are purged, then the dataset is no longer listed, even though it might still be used in active forms as a form attachment. When we list datasets, we filter out datasets from form drafts that haven't been published yet. But I think the result of that filtering is also this behavior around purging.

Here's the relevant SQL:

const _getAllByProjectId = (projectId) => sql`
SELECT DISTINCT d.* FROM datasets d
JOIN dataset_form_defs dd ON d.id = dd."datasetId"
JOIN form_defs fd ON fd.id = dd."formDefId"
WHERE fd."publishedAt" IS NOT NULL AND d."projectId" = ${projectId}
`;

Here's a try at a test along these lines:

it('should list a dataset even if related forms have been purged', testService((service, container) =>
  service.login('alice', (asAlice) =>
    asAlice.post('/v1/projects/1/forms?publish=true')
      .set('Content-Type', 'application/xml')
      .send(testData.forms.simpleEntity)
      .expect(200)
      .then(() => asAlice.get('/v1/projects/1/datasets')
        .expect(200)
        .then(({ body }) => {
          body.length.should.equal(1);
        }))
      .then(() => asAlice.delete('/v1/projects/1/forms/simpleEntity')
        .expect(200))
      .then(() => container.Forms.purge(true))
      .then(() => asAlice.get('/v1/projects/1/datasets')
        .expect(200)
        .then(({ body }) => {
          body.length.should.equal(1);
        })))));

To fix this, I think we may need to add a "published" flag to the datasets table. That would allow us to recognize a dataset as having been associated with a published form even if that published form is purged.


Related case:

  • Form draft is created. The form draft creates a dataset.
  • The form draft is abandoned.
  • Right now the dataset will continue to exist in the database.
  • If a new form draft is created that creates entities for this dataset, then the datasets diff endpoint will indicate that the dataset is a new dataset.
  • If the form is deleted entirely, then the dataset will be orphaned.

@matthew-white
Copy link
Member

matthew-white commented Nov 3, 2022

Main issue: #671

I've been taking a look at #576 and noticed these two remaining related TODOs from #633:

//.then(getOrNotFound) // TODO: change dataset query to return option so this can be used

// TODO: right now this returns 500 internal server error
it.skip('should reject if dataset does not exist', testService((service) =>
service.login('alice', (asAlice) =>
asAlice.post('/v1/projects/1/forms?publish=true')
.send(testData.forms.simpleEntity)
.set('Content-Type', 'application/xml')
.expect(200)
.then(() => asAlice.get('/v1/projects/1/datasets/nonexistent/download')
.expect(404)))));

@matthew-white
Copy link
Member

matthew-white commented Nov 3, 2022

Main issue: #676

If the form specifies a dataset name with leading or trailing white space, then validateDatasetName() will trim the name before validating it. However, getDataset() doesn't seem to trim it when it returns it, so the dataset name ends up having white space. I'd expect that either validateDatasetName() would reject such a name or that getDataset() would also trim it. I tried to write a test along those lines:

it('should trim the dataset name', testService((service) =>
  service.login('alice', (asAlice) =>
    asAlice.post('/v1/projects/1/forms?publish=true')
      .set('Content-Type', 'application/xml')
      .send(testData.forms.simpleEntity.replace('people', ' people\n'))
      .expect(200)
      .then(() => asAlice.get('/v1/projects/1/datasets')
        .expect(200)
        .then(({ body }) => {
          body.length.should.equal(1);
          body[0].name.should.equal('people');
        })))));

@matthew-white
Copy link
Member

matthew-white commented Nov 3, 2022

Main issue: #676

Another one about dataset name. I'm thinking that we shouldn't make that case-insensitive. xmlFormId and form attachment name are case-sensitive, so I think it makes sense for dataset name to be so as well. Dataset names can contain any Unicode letter, but I think it's difficult to combine that with case-insensitivity. For example, toLowerCase() doesn't return the correct result in Turkish for İ (dotted I).

@matthew-white

This comment was marked as resolved.

@matthew-white

This comment was marked as resolved.

@issa-tseng

This comment was marked as resolved.

@lognaturel

This comment was marked as resolved.

@matthew-white

This comment was marked as resolved.

@matthew-white

This comment was marked as resolved.

@matthew-white
Copy link
Member

matthew-white commented Nov 9, 2022

Main issue: #671

I'm noticing the following:

  • Upload but don't publish a form definition that would create entities for a new dataset named people.
  • …/datasets does not return people.
  • However, …/datasets/people/download does serve the entities CSV file. I'd expect that endpoint to return a 404 until the form definition is published.

@matthew-white
Copy link
Member

matthew-white commented Nov 9, 2022

Main issue: #671

There seems to be some inconsistency around how we handle "draft Datasets" and form attachments:

  • Upload but don't publish a form definition that would create entities for a new dataset named people.
  • Upload a form definition for a different form that uses a form attachment named people.csv.
  • people.csv is not autolinked to the people dataset, which is what I would expect.
  • However, it looks like you can use the PATCH endpoint to link people.csv to the unpublished people dataset. I'd expect that endpoint to return a 404 until the first form definition is published.

@matthew-white

This comment was marked as resolved.

@matthew-white
Copy link
Member

matthew-white commented Nov 9, 2022

Main issue: #671

Another one about form drafts + datasets. I'm seeing that as soon as you upload a form definition that would add a property to an existing dataset, that property is included in the entities CSV file. Even if you abandon the form draft, the property is still included in the CSV file. I'd expect that a property is added to the CSV file only once the draft is published.


Add a publishedAt timestamp to ds_properties?

@lognaturel
Copy link
Member

lognaturel commented Nov 10, 2022

Main issue: #676

One more thing to add under " Update dataset/entity parsing to reflect latest version of spec": we need to reject form definitions that are written for future spec versions. More here.

We're going to release the first spec as 2022.1.0. The patch version is there to keep track of changes that don't affect compatibility (e.g. wording changes to the spec document, optional additions). So anything that doesn't start with 2022.1 should be rejected.

Ideally the entities-version attribute is only read if it is in the http://www.opendatakit.org/xforms/entities namespace.

@matthew-white

This comment was marked as resolved.

@ktuite
Copy link
Member Author

ktuite commented Nov 22, 2022

Closing this issue because everything mentioned in here is either completed 🎉 or captured in a new issue.

@ktuite ktuite closed this as completed Nov 22, 2022
@lognaturel
Copy link
Member

🎉 🎉 👏 👏 👏 🎉 🎉

@matthew-white matthew-white unpinned this issue Nov 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
entities Multiple Encounter workflows
Projects
None yet
Development

No branches or pull requests

4 participants