Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

User migration via export file #3054

Merged
merged 65 commits into from
Jan 2, 2024
Merged

User migration via export file #3054

merged 65 commits into from
Jan 2, 2024

Conversation

hughrun
Copy link
Contributor

@hughrun hughrun commented Oct 22, 2023

This PR aims to implement #1012. It introduces new export and import functionality, that allows users to export their entire account and all associated media in a self-contained tar file.

Resolves #1012
Resolves #2666

Archive structure

The tar.gz archive contains three parts:

  • archive.json: Containing the user settings, names and books used by the user.
  • avatar.png/avatar.jpg: A user's avatar image if present
  • covers/: the book covers used by the user

Included

  • user profile info
  • basic user settings
  • reading goal
  • books
  • readthroughs
  • shelves
  • book lists
  • reviews
  • comments
  • quotes
  • saved book lists
  • follows (who you follow)
  • blocks (who you block)

Not included

  • user setting relying on changeable server settings (e.g. themes)
  • replies to statuses
  • followers (these are managed in Add Move activity for user migration #2970)
  • Groups (these are currently not federated so can't be migrated)

Also note that for reasons of expediency the JSON export is a custom structure and not in the form of ActivityPub JSON-LD.

Implementation

This project was mostly implemented in collaboration during the first Guild Alpha Sprint.

We decided to use a custom JSON archive as the basis of this proposal, as @hughrun had already done work on that, which we used as a base.

Contributors:

Special thanks to @Ryuno-Ki and @circlebuilder

Guild Alpha added:

  • Improved serialization of the archive.json
  • Move exporting and importing to celery tasks
  • Wrap the archive.json into a tar archive
  • Add media files to the archive
  • Unit Tests of the import and export process

Due to time constraints we couldn't implement everything needed. Subsequently @hughrun has added:

  • Cool-down period between creating archives to prevent overloading the server
  • Notification on import & export completion
  • Error handling & retry mechanism. Currently when an import or export fails the Task status remains active. The status doesn't show "failed" and no error is given.
  • Admin controls to monitor and stop user imports

Related work

This was originally offered as #2980 - it was pulled into a new branch so the work could be completed
This is a sibling PR with #2970

Remaining work - help wanted

  • mypy doesn't like bookwyrm/utils/tar.py and I'm not sure what to do or whether I can just tell mypy to ignore types for now. There's also a related test that is failing
  • there are some pylint errors in test_bookwyrm_import_job which indicate I probably need to refactor a test there but I can't figure out what I should be doing
  • this is a big PR and needs a couple of people to thoroughly review it

Other work - possibly in scope

  • It may make sense to review how this works with Add Move activity for user migration #2970 as we can probably consolidate some menus and definitely will need some clear instructions. Possibly we could pull them both into main but check for potential streamlining before releasing as production-ready.

Future work - not in scope

CSDUMMI and others added 20 commits September 7, 2023 22:37
…etween instances (#38)

Co-authored-by: Daniel Burgess <[email protected]>
Co-authored-by: Hugh Rundle <[email protected]>
Co-authored-by: dannymate <[email protected]>
Co-authored-by: hughrun <[email protected]>
Reviewed-on: https://codeberg.org/GuildAlpha/bookwyrm/pulls/38
Co-authored-by: CSDUMMI <[email protected]>
Co-committed-by: CSDUMMI <[email protected]>
* cleans up some test logging
* cleans up some commented-out code
* adds export_job model tests
* reconsiders some tests in export user view tests
Complete Migrations of Bookwyrm Accounts across instances

Merging this into `user-migration` branch to enable final work on this within the main Bookwyrm repository. We will pull in the final PR from there into `main` when ready.

Thanks to @CSDUMMI and the crew for this huge job.
* fix Safari not downloading with the correct filename
* add FAILED status
* don't provide download link for stopped jobs
add USER_EXPORT_COOLDOWN_HOURS setting for controlling user exports and imports
- makes user_import_time_limit a site setting rather than a value in settings.py (note this applies to exports as well as imports)
- admins can change user_import_time_limit from UI
- admins can cancel stuck user imports
- disabling new imports also disables user imports
complete most outstanding user migrate tasks
once more into the linting breach!
@hughrun hughrun changed the title User migration User migration via export file Oct 22, 2023
@hughrun hughrun marked this pull request as ready for review October 22, 2023 07:49
Copy link
Contributor

@dato dato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi!

I took a look at the mypy/pylint errors you mentioned, please see below.

bookwyrm/utils/tar.py Outdated Show resolved Hide resolved
bookwyrm/utils/tar.py Outdated Show resolved Hide resolved
bookwyrm/utils/tar.py Outdated Show resolved Hide resolved
bookwyrm/tests/models/test_bookwyrm_import_job.py Outdated Show resolved Hide resolved
@hughrun
Copy link
Contributor Author

hughrun commented Oct 23, 2023

Thanks @dato those fixes do indeed seem to resolve most of the problems. And thanks @mouse-reeve for picking up my over-eager linting "fix" with the tuple.

I've sorted these out on my local, I'll probably push them up shortly. (done)

Now that's resolved, I've done another manual test and realised there is something here creating duplicate IDs in the database. I think it's probably not the BookwyrmImport functions, because it's happening every time I do a normal add of a book from another bookwyrm site: just a chaos of duplicate user IDs all over the place. Having said that, we also seem to be getting duplicate statuses on import which is obviously not great. We have reports of duplicates on production sites now, so it's possible this is something that wasn't introduced in this PR but is manifesting more obviously for some reason.

I'll have to dig in and see if I can work out what's going on. Any extra help would be appreciated, for now I'm going to change this PR back to draft.

(edit - I think it was just a borked environment, I tested again after a couple of tweaks and all good).

@hughrun hughrun marked this pull request as draft October 23, 2023 07:43
Mouse Reeve added 2 commits November 15, 2023 16:46
I think the test was failing because it was extremely brittle, not
because of anything wrong with the code itself.
@mouse-reeve
Copy link
Member

I can't get ANY of the tests to run locally. It looks like migration 0187 is referencing 0186_alter_notification_notification_type, which should actually be 0185_alter_notification_notification_type? I tried deleting the migrations and making new ones with no merge migrations at all.

That doesn't fix the failing migration test, but I think that's because that test is really brittle -- it looks like it's failing because it's using the database schema pegged to a specific migration but the current code, which is in a different state. I propose deleting it.

@mouse-reeve
Copy link
Member

On reflection, I wonder if you have a local copy of migration 0186 that was allowing your tests to pass? At any rate, if the commit I just made passes, I think that's good? If not (or if you don't think we should delete that test) I'm happy to nix it.

@hughrun
Copy link
Contributor Author

hughrun commented Nov 16, 2023

Looks like that worked!

I think I messed up a merge somewhere, but also this PR has taken so long with so many different merges it's a bit trickier to keep all the migrations aligned.

I would recommend testing a migrate on a pre-existing DB once this is merged into main, before releasing it to production.

Mouse Reeve added 2 commits December 9, 2023 08:09
Adds a link in the text of the notification, and fixes references to
notification type in the model
I think this wording is a little clearer
@mouse-reeve
Copy link
Member

I made a small change to fix a bug in how notifications were being created, and changed the wording to (hopefully) clarify the difference between CSV import/export and user import/export, since user import/export does include both user and book data:
Screenshot 2023-12-09 at 8 19 26 AM

Otherwise this looks good to me. I'd like to get it merged and do a release early next week 🥳 -- it would also be great to get someone else to test this (@bookwyrm-social/code-review) both for bugs and for clarity.

@dato
Copy link
Contributor

dato commented Dec 11, 2023

Hello! (I apologize that I didn't run these tests much earlier.)

I tested this on my development instance, so I exported data for my main user (id=2), then created an empty user (id=90) and imported into that account.

While books in shelves where imported correctly:

bookwyrm_devel=> SELECT user_id, book_id, shelf_id
  FROM bookwyrm_shelfbook
  WHERE user_id in (2, 90)
  ORDER BY 1, 2, 3;

 user_id | book_id | shelf_id 
---------+---------+----------
       2 |       8 |        6
       2 |       8 |        9
       2 |       9 |        6
      90 |       8 |       19
      90 |       8 |       22
      90 |       9 |       19

something weird (bad?) happened with statuses: it would seem they simply got reassigned to user id 90 (in the web UI, indeed the origin user has no statuses in their Activity tab after the import):

bookwyrm_devel=> SELECT id as status_id, user_id, created_date, updated_date
  FROM bookwyrm_status
  WHERE user_id in (2, 90)
  ORDER BY 3;

 status_id | user_id |         created_date          |         updated_date          
-----------+---------+-------------------------------+-------------------------------
         1 |      90 | 2023-09-24 19:17:48.920904-03 | 2023-12-11 05:04:34.542547-03
         2 |      90 | 2023-09-24 19:51:24.496355-03 | 2023-12-11 05:04:34.949896-03
         3 |      90 | 2023-10-14 22:13:13.460129-03 | 2023-12-11 05:04:34.660894-03
         4 |      90 | 2023-10-14 22:13:35.197837-03 | 2023-12-11 05:04:35.06195-03
         5 |      90 | 2023-11-15 18:52:25.785505-03 | 2023-12-11 05:04:35.16605-03
(5 rows)

Importing into the export instance is arguably a corner case, albeit one I think people might need to make use of for username changes.

I haven't had a chance to look into the code yet, though I suspect the fix might be easier than this write-up!, heh.

Edited to add: I was wondering if this was minor enough that the PR could be merged anyway, but I just realized this means a malicious user could (perhaps?) craft an archive.json that deletes another user’s data.

@hughrun
Copy link
Contributor Author

hughrun commented Dec 11, 2023

@dato thanks for being so thorough.

it would seem they simply got reassigned to user id 90

I hadn't expected this, but it makes sense. In the original way I wrote this it wouldn't have happened, but making export files as close to ActivityPub objects has a lot of advantages so we use to_model() on import.

The issue here is in bookwyrm_import_job:

instance = parsed.to_model(model=cls, save=True, overwrite=True)

The only way I could get this to work is to use overwrite=True, which ...does what you would expect. This is what I was struggling with in hughrun#1 So we swap out the User to avoid assigning the status to, to take your example, the user with an pk of 2 on a new server. But if it's the same server then the effect is to just keep the statuses that already existed but change the user id. This is actually fine if everyone is doing the right thing, but as you point out the import file can edited or created from scratch.

I was wondering if this was minor enough that the PR could be merged anyway, but I just realized this means a malicious user could (perhaps?) craft an archive.json that deletes another user’s data.

I admit this is something I hadn't thought of, and you're right, it would be possible for a malicious user - though you would also need to know the user's primary key, which isn't necessarily obvious, and craft an export file targeted at their post IDs. This PR was originally started before #3076, but now that's merged in we could potentially add a check to insist that the source account has to have movedTo set to the target account. That would however make it impossible to migrate an account after the source server is shut down or if the source account is deleted rather than 'moved', for example. Perhaps that's a price we have to pay to remove a potential vector for malicious activity?

@dato
Copy link
Contributor

dato commented Dec 11, 2023

so we use to_model() on import

I took a closer look because of this comment, and there is a second issue because the new statuses preserve the remote_id field, pointing to the old URLs. Clicking on their permalink (“15 hours ago...”) fails:¹

bookwyrm_devel=> SELECT s.id, username, s.remote_id
  FROM bookwyrm_status s
  JOIN bookwyrm_user u
  ON u.id = s.user_id
  WHERE user_id = 90
  ORDER BY 1;

 id |           username           |                        remote_id                        
----+------------------------------+---------------------------------------------------------
  1 | [email protected] | https://libros.dev.rizoma1.ar/user/admin/comment/1
  2 | [email protected] | https://libros.dev.rizoma1.ar/user/admin/comment/2
  3 | [email protected] | https://libros.dev.rizoma1.ar/user/admin/reviewrating/3
  4 | [email protected] | https://libros.dev.rizoma1.ar/user/admin/review/4
  5 | [email protected] | https://libros.dev.rizoma1.ar/user/admin/review/5
(5 rows)

(¹) It fails because those URLs don't exist any more. But if this happens in cross-instance imports (I'm not sure that it does), it could be considered worse since those permalinks should stay in-instance.’

Edited to add: confirmed it happens when importing into a separate instance:

bookwyrm_test=# SELECT s.id, username, s.remote_id
  FROM bookwyrm_status s
  JOIN bookwyrm_user u
  ON u.id = s.user_id
  WHERE user_id = 1 
  ORDER BY 1;

 id |           username           |                        remote_id                        
----+------------------------------+---------------------------------------------------------
  1 | [email protected] | https://libros.dev.rizoma1.ar/user/admin/comment/1
  2 | [email protected] | https://libros.dev.rizoma1.ar/user/admin/reviewrating/3
  3 | [email protected] | https://libros.dev.rizoma1.ar/user/admin/comment/2
  4 | [email protected] | https://libros.dev.rizoma1.ar/user/admin/review/4
  5 | [email protected] | https://libros.dev.rizoma1.ar/user/admin/review/5
(5 rows)

@hughrun
Copy link
Contributor Author

hughrun commented Dec 11, 2023

dang. I missed that, I think I assumed it would create a new remote_id but logically, why would it? I think I see what needs to happen here, will take a look tonight my time (and also remove a few rogue print statements).

@mouse-reeve
Copy link
Member

mouse-reeve commented Dec 11, 2023

I’m so glad you caught that, it does make sense that the old remote is remains. Every model should have a function to generate the remote id which we can re-run and save

def get_remote_id(self):

@hughrun
Copy link
Contributor Author

hughrun commented Dec 12, 2023

I think I've got a working solution for this. This solution will mean the import still works for everything else, but statuses won't be created if the source account hasn't been set as movedTo the account that we're importing into: i.e. you have to export, then "move" the account, then you import the user data to the new account.

I'm not sure of the best way to present any errors to the user though. If it fails to import statuses that will probably be because they're a legit user who forgot to do things in the correct order, or maybe doesn't have access to the old account any more, so we need to provide clear info about why statuses haven't imported rather than just silently not importing them.

Any suggestions @mouse-reeve ?

@dato
Copy link
Contributor

dato commented Dec 12, 2023

I think I've got a working solution for this.

Great! I have two questions if that's okay:

  1. When statuses are imported, I assume they still need to be special-cased so that their remote_id is made proper. Is this done in a way that new objects are created, instead of reusing the existing ones? (for the case of same-instance import).

  2. To perform a Move, the target account has to advertise the old one as one of its aliases, is that right? For importing statuses, could we require, only, that the source-of-export account advertises the target account as one of its aliases? (i.e. an alias in the opposite direction that the move would require). This seems proof of ownership enough, and wouldn't require performing the move before having a chance to verify the import.

Thanks!!

@hughrun
Copy link
Contributor Author

hughrun commented Dec 12, 2023

@dato

  1. Yes. I momentarily thought maybe I'd done something wrong because I was getting duplicated statuses, but overnight I realised that the only thing that BookWyrm matches on for statuses is the remote_id. When that is changed a new status will always be created. This is a very sensible default usually.

However, that creates a new problem in that we don't always want to do that when importing (e.g. if for some reason a user imports the same file twice). So I'll have to add some custom logic there.

  1. This is a good suggestion and would be easy to do. If the target user is set as alsoKnownAs OR movedTo by the source, it's ok.

- remote_id is now updated on import of statuses
- statuses cannot be imported unless source has target listed in alsoKnownAs or movedTo
- add alert boxes to import and export screens advising of the above
- update tests accordingly
@hughrun
Copy link
Contributor Author

hughrun commented Dec 13, 2023

@dato @bookwyrm-social/code-review please take a look at this now.

Generally, does anything look problematic?

Specifically:

  • does the latest commit resolve the bug @dato identified?
  • does it introduce any new problems?
  • are the messages about setting an alias sufficient?

Copy link
Contributor

@dato dato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • does the latest commit resolve the bug @dato identified?
  • does it introduce any new problems?
  • are the messages about setting an alias sufficient?

I believe the commit is good, yes!

One tiny comment below, but absolutely not a blocker, is a corner case (again).

Thanks!

bookwyrm/models/bookwyrm_import_job.py Show resolved Hide resolved
@mouse-reeve mouse-reeve merged commit ca79cb1 into main Jan 2, 2024
10 checks passed
@mouse-reeve mouse-reeve deleted the user-migration branch January 2, 2024 03:04
@dato
Copy link
Contributor

dato commented Jan 2, 2024

👏🏼👏🏼👏🏼

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants