Deduplication / uniqueness not supported? Merging based on custom logic seems required for cross-device sync. #117
Replies: 3 comments 3 replies
-
Hi @rkhamilton, we have thought about adding a customizable uniqueness conflict handler to the library so that when we are inserting data from CK into the user's local database and SQLite emits But, as it turns out, we feel there are alternative ways to handle the 3 use cases you gave above that do not require any additional infrastructure from the library. I'll explain each of them individually:
If you were to model your tags table like so: CREATE TABLE "tags" (
"title" TEXT PRIMARY KEY NOT NULL,
…
) …then you would get conflict resolution for free from our library. When synchronizing, we always upsert data based on primary key (which is the title of the tag) and conflicting records are merged using a last-edit-wins strategy on a per-field basis. You can even add "COLLATE NOCASE" if you want case-insensitivity in the uniqueness. And so if an iPhone created a photo with tag "Family" and an iPad created a different photo with tag "Family", then when the devices synchronize it should work just fine. Both new photos will be properly tagged with the unique "Family" tag. Now using something like tag title as a primary key does come with new challenges because you may want to allow the user to edit the tag, but SQLite deals with them well. When creating foreign keys to the "tags" table you will want to make sure to use "ON UPDATE CASCADE" so that editing the title of a tag will update all of its foreign keys.
I believe this all works out of the box right now. The unique ID from the bank should be the primary key of the table, and as I mentioned above, primary keys are treated specially in the library. When records are sync'd from CK we perform an upsert based on the PK and then resolve conflicts on a per-field basis. There will be no duplicated data in this situation.
This too should already work just fine, as long as your seeded data has stable primary keys. That is, the data seeded from device A generates the same primary keys as device B. If that is upheld, then again there will be no duplicated data even when multiple devices sync the same seeded data. So, as far as I can tell, the main situations you are concerned with are already handled by the library. If there are other situations in which non-primary key fields need to have uniqueness constraints, then we may consider adding that constraint failure handler I mentioned at the beginning of this post, but we first need a rock solid use case to test it out in full. We can also update our reminders demo app to make it so that the tags table has its title as the primary key. That will show how this can work in practice. But we do actually already have one example of the techniques I described above. Our reminders app allows one to associate cover images to lists. We store the image data in a separate table outside of the "remindersLists" table, and that table uses "remindersLists"."id" as both its primary key and a foreign key: So that naturally gives us a uniqueness constraint that ensures there is at most one image associated with a reminders list, and if two devices create images at the same time, the records will be merged on a per-field basis. |
Beta Was this translation helpful? Give feedback.
-
Hi @mbrandonw thank you for the thoughtful response. I can see that my thinking on this topic is heavily shaped by thinking in terms of core data object instances where each is created with a unique persistent identifier rather than database rows. Your explanations make sense to me. I had reviewed your example apps and didn't see a way to generate an upsert / deduplication scenario, which contributed to my thinking it wasn't supported. I think it would be helpful to also make it possible to add Tags to the Reminders app in addition to the primary key change (right now they are read-only with no UI to add or edit). As the documentation is developed I think it would also be helpful to explicitly discuss this topic, as it's a non-trivial consideration for people coming from Core Data / Swift Data, and it may be a non-issue with sharing-grdb. That's a nice benefit of the library! |
Beta Was this translation helpful? Give feedback.
-
Hi @mbrandonw I realized last night that I have a real world example that is not transparently solved by the behavior you describe because it requires uniqueness on both a property and a relationship. I made a Swift Data app for the iOS 17 launch that is a kind of weather journal. Users identify locations on a map, and the app uses a weather API to download daily weather summaries, which are persisted and summarized. As a toy example: @Model
public final class Location {
public var id: UUID = UUID()
@Relationship(deleteRule: .cascade, inverse: \WeatherRecord.location)
public var weatherRecords: [WeatherRecord]? = []
}
@Model
public final class WeatherRecord: Identifiable {
public var id: UUID = UUID()
public var date: Date = Date.now
public var dailyHighTemperature: Double? // data from the weather API
@Relationship(deleteRule: .nullify)
public var location: Location?
} The business logic is that we are building a calendar of historical data, with one WeatherRecord per day, per location. So the uniqueness constraint for a WeatherRecord table is that they must be unique for both date and Location. The lack of good deduplication support in Swift Data prevented me from ever enabling CloudKit support in this app because each device would download and persist the same weather API data for each location. I could imaging that I would be able to use your existing uniqueness functionality to solve this by constructing a primary key for WeatherRecord that is composed of its owning Location's UUID plus its own date. Something like "(self.location.id.uuidString)+(self.date.timeIntervalSinceReferenceDate)" which would be unique for the WeatherRecord table. Does this seem like the right way to solve this problem using sharing-grdb? I'm not a database person, so I don't know if primary key solutions like this are normal or if that smells strange to you. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
When syncing between devices on the same Apple ID, business logic may define data that should be unique based on some field(s), but CloudKit does not support uniqueness constraints. As such, duplicate data may be uploaded to CloudKit from each device where the duplicates must be later merged using custom logic. In this scenario, we are not updating an entity where the last update wins, but creating a new entity which should be merged into an existing entity, preserving the relationships that existed on the original entities. For example:
I have never worked on an app that had cross device sync where data deduplication wasn't a problem that needed to be solved. It's mostly about syncing globally unique strings, like tags on photos, or we are downloading data from an API that is processed and synced to CloudKit, and multiple devices will fetch the same data if they don't know another device has already fetched it.
Existing solutions
I'm aware of three sample Core Data projects from Apple that solve this problem, starting with WWDC 2012. The overall flow uses persistent history tracking in this process:
Apple has provided two more modern sample code projects demonstrating the solution in detail:
The Apple sample project Synchronizing a local store to the cloud has Tags that are deduplicated based on name. There is an edge case bug in this code where relationships may be lost when deduplicated, but the code is simpler to understand.
The problem of lost references is solved in their newest (iOS 17.4), more complex example project Sharing Core Data objects between iCloud users. This example also removes duplicate Tags based on name, but uses somewhat more complex logic to avoid the relationship-deletion problem in the earlier example. In this project, see Persistence/PersistenceController+Deduplicate.swift. The solution is essentially to flag entities as "to be deduplicated" and remove them after a delay. The file referenced above includes a comment that explains their process.
Swift Data
Swift Data was launched without support for Persistent History Tracking, and this made it impossible to deduplicate efficiently. Persistent History Tracking was added to Swift Data in iOS 18, but it does not seem to be full-featured enough to solve the problem correctly. I admit to not having looked closely at the persistent history tracking feature in Swift Data so perhaps it is now possible.
Conclusion
This may just be a reframing of the problem that CloudKit doesn't support @unique attributes, but it is a problem that must be solved for cross device sync. Apple chose to solve the lack of uniqueness constraints in CloudKit using complex on-device merge logic.
I've looked through the sharing-grdb example projects and documentation and it doesn't seem possible in the current beta to either enforce uniqueness during upload, or to merge based on properties when downloading changes. What are your thoughts on how this uniqueness/ deduplication problem can might addressed? Perhaps I am missing some functionality.
Beta Was this translation helpful? Give feedback.
All reactions