Skip to content

Make source manager the "source of truth" for the database #868

@josh-chamberlain

Description

@josh-chamberlain

Context

We have the data sources app and database, which currently handles submission, approval, and display of sources; we sync bi-directionally with the source manager, which is used to identify new sources and manage duplication. It would be much cleaner if sync only ran one way, and we used the source manager for what it says on the tin and the data sources app for doing stuff with the polished dataset.

May replace issues

Goals

  • Let's define how we'd like the databases/apps to function, and work from there.

Sources vs URLs

  • It may be immensely helpful to delineate what we mean by "sources" vs "URLs".
    • for example, a URL may contain multiple sources
    • a URL may not be considered a source until it is deemed relevant, and/or meets other metadata thresholds; until then it's just a URL

Source manager app/database

  • Contains all URLs that have entered our orbit, whether relevant sources or not
  • Contains extended metadata about URLs
  • Handles processing and creation of new sources through the database
  • Handles creation and modification of sources and agencies
  • Handles linking sources to agencies
  • Handles source health checks, redirect and duplicate mapping
  • Contains information about each URL's status
  • Triggers archives, stores archives for each URL

Data sources app/database

  • The main offering of PDAP, and pivot point for ~all our other tools and activities
    • The foundation of our API and front end
    • A repository of relevant sources with as much context as possible
  • Sources don't appear here until minimum metadata threshold is met; this sync happens automatically, frequently, incrementally
    • relevancy, agency, record type, name
  • Access-related metadata is useful here (format, coverage range, etc)
  • Information about users and their follows/requests, which informs the source manager

Requirements

  • Generally the goal is to make the databases sync in one direction: source manager → data sources app
    • This means that if the source manager requires anything from the data sources app, it must use the API
  • Ideally this migration would happen via API; i.e. via POST from SM or GET from DS
  • The goal should be a refactor. The front end should largely be unaffected, except perhaps base_url and endpoint definitions for data source submission
    • notably, most of our tools for managing data sources and agencies in retool would be affected
  • We would probably make a few views, materialized or not, which contain properties relevant for management vs presentation
  • Let's take this opportunity to go through the schemae with a fine-tooth comb to make this as future-proof as possible

Why not just have one database and API?

In theory, we could combine into one database and use materialized views and connection pools to protect read traffic, but we decided to keep them separate to prioritize data sources app uptime, and protect it from the more unstable, larger-scale, more complex source manager app.

Visualizing the plan

🏠 = the item's home
🪞 = the item is mirrored here*
🌐 = the item is accessed via API

schema items / features source manager data sources app notes
All URLs in our orbit 🏠
Submission of new sources 🏠
Editing sources + metdata 🏠
Validated URLs of varying health 🏠 🪞 deciding what is mirrored using a well-documented Great Filter
Access-related "OG" metadata about sources 🏠 🪞 As described in the Data Sources data dictionary.
Agencies, locations 🏠 🪞 and links between them and sources
Source status & health 🏠 🪞
Users 🌐 for auth 🏠
Follows 🌐 for prioritization 🏠
Requests 🌐 for prioritization 🏠

Sub-issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    Priority Dev

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions