-
Notifications
You must be signed in to change notification settings - Fork 6
Description
Context
- Follows discussion Re-Examine Two-Database Structure data-source-manager#399
- Blocked by deploy from protected
mainbranch data-source-manager#474
We have the data sources app and database, which currently handles submission, approval, and display of sources; we sync bi-directionally with the source manager, which is used to identify new sources and manage duplication. It would be much cleaner if sync only ran one way, and we used the source manager for what it says on the tin and the data sources app for doing stuff with the polished dataset.
May replace issues
Goals
- Let's define how we'd like the databases/apps to function, and work from there.
Sources vs URLs
- It may be immensely helpful to delineate what we mean by "sources" vs "URLs".
- for example, a URL may contain multiple sources
- a URL may not be considered a source until it is deemed relevant, and/or meets other metadata thresholds; until then it's just a URL
Source manager app/database
- Contains all URLs that have entered our orbit, whether relevant sources or not
- Contains extended metadata about URLs
- Handles processing and creation of new sources through the database
- Handles creation and modification of sources and agencies
- Handles linking sources to agencies
- Handles source health checks, redirect and duplicate mapping
- Contains information about each URL's status
- Triggers archives, stores archives for each URL
Data sources app/database
- The main offering of PDAP, and pivot point for ~all our other tools and activities
- The foundation of our API and front end
- A repository of relevant sources with as much context as possible
- Sources don't appear here until minimum metadata threshold is met; this sync happens automatically, frequently, incrementally
relevancy,agency,record type,name
- Access-related metadata is useful here (format, coverage range, etc)
- Information about users and their follows/requests, which informs the source manager
Requirements
- Generally the goal is to make the databases sync in one direction:
source manager → data sources app- This means that if the source manager requires anything from the data sources app, it must use the API
- Ideally this migration would happen via API; i.e. via POST from SM or GET from DS
- The goal should be a refactor. The front end should largely be unaffected, except perhaps base_url and endpoint definitions for data source submission
- notably, most of our tools for managing data sources and agencies in retool would be affected
- We would probably make a few
views, materialized or not, which contain properties relevant for management vs presentation - Let's take this opportunity to go through the schemae with a fine-tooth comb to make this as future-proof as possible
Why not just have one database and API?
In theory, we could combine into one database and use materialized views and connection pools to protect read traffic, but we decided to keep them separate to prioritize data sources app uptime, and protect it from the more unstable, larger-scale, more complex source manager app.
Visualizing the plan
🏠 = the item's home
🪞 = the item is mirrored here*
🌐 = the item is accessed via API
| schema items / features | source manager | data sources app | notes |
|---|---|---|---|
| All URLs in our orbit | 🏠 | ||
| Submission of new sources | 🏠 | ||
| Editing sources + metdata | 🏠 | ||
| Validated URLs of varying health | 🏠 | 🪞 | deciding what is mirrored using a well-documented Great Filter |
| Access-related "OG" metadata about sources | 🏠 | 🪞 | As described in the Data Sources data dictionary. |
| Agencies, locations | 🏠 | 🪞 | and links between them and sources |
| Source status & health | 🏠 | 🪞 | |
| Users | 🌐 for auth | 🏠 | |
| Follows | 🌐 for prioritization | 🏠 | |
| Requests | 🌐 for prioritization | 🏠 |
Sub-issues
Metadata
Metadata
Assignees
Labels
Type
Projects
Status