Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incremental synchronization of elements to files #66

Open
joshsh opened this issue Jul 11, 2023 · 1 comment
Open

Incremental synchronization of elements to files #66

joshsh opened this issue Jul 11, 2023 · 1 comment
Labels

Comments

@joshsh
Copy link
Member

joshsh commented Jul 11, 2023

There are currently several ways to serialize SmSn graphs to the file system, but the most important for everyday use is the so-called VCS serialization. Every atom corresponds to a file, in a directory corresponding to the logical datasource associated with the atom. Because every representation in SmSn is a set of atoms, this results in a huge number of files. However, a bigger problem is the fact that the entire graph must be synced with the file system at once. This takes significant time (several minutes), and requires the user to stop what they are doing and very consciously attend to the synchronization process. It's a major barrier for adoption, especially vis-a-vis solutions like Org-mode which sync to the file system directly. It is also somewhat of a liability to use Neo4j as the source of truth for SmSn data in between sync operations. I have personally experienced major data loss and corruption when Neo4j silently failed for some reason or other, and too much time elapsed between syncs.

Fixing this problem should actually result in a much simpler solution than the current one. Going forward, there will be a configurable source of truth which could be Neo4j or another TinkerPop-enabled graph DB, but also could be another data store such as the file system. The latter will be the default, and the former might be added again later (SmSn is not up to date with recent versions of Neo4j). No bulk sync operations will be necessary when the file system is the SoT, and the user will be free to place the data directory under version control using a solution of their choice. Much as we do now, we will provide a starter kit using Git as the version control solution.

cc @jmatsushita

@joshsh joshsh added the v2 label Jul 11, 2023
@joshsh
Copy link
Member Author

joshsh commented Jul 11, 2023

Note: as SmSn graphs are typically small enough to fit in memory, an in-memory cache of the user's graph will be available on the server. When elements of the graph are updated, the update will be placed in a queue to update the file system, but subsequent reads will not block on the file system update. The usual considerations apply with respect to concurrency and consistency. A bulk refresh-from-disk operation will still need to be supported, but this can be more efficient than the current bulk read if we store a timestamp with each element (and read an element's file only if the timestamp has changed).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant