Skip to content

Conversation

@bwalsh
Copy link

@bwalsh bwalsh commented Jun 11, 2025

Status

  • test progress: all passed (skipped single table tests)

Problem:
As an indexd or DRS user, when I list objects, I only expect to see items that belong to projects I have access to.

Solution:
Assuming a Bearer token is included on the request, I expect indexd to query arborist, extract the projects I have access to and add those as an "authz" filter when querying the database. A feature flag should control this query injection, the flag should default to FALSE, as this will improve chances of getting a PR approved. All current unit tests should pass. Additional unit tests should confirm behavior.

Alternatives:
We could have a RBAC aware proxy front end indexd - however will add complexity and processing overhead

Context:
Main auth code has two methods auth and authz. The indexd.authorize method checks if Basic auth header is present auth is called otherwise authz is called. The revproxy gateway injects this header here This reliance on Basic auth is concerning and it's rationale is undocumented. It appears that it is not used for either create or read based on client API

Approach:
Add code to get_index to call auth_mapping
and inject resources (projects) into query.

  • skip if feature flag not enabled
  • 401 if Bearer token not available
  • update dependency gen3authz as latest version includes token as parameter (as an alternative to username)
  • use mock_arborist_requests pytest fixture
  • update tests
  • add new tests specific to RBAC
  • Add feature flag to default_settings
  • Add corresponding feature flag to helm chart
  • Remove extraneous logging and debugging code

@bwalsh
Copy link
Author

bwalsh commented Jun 11, 2025

reviewers guide

Setup Instructions

Testing

  • See pytest
    • Suggest doing this on master first branch to ensure all tests pass

Code Review

  • main changes were made to indexd.index.blueprint.py::get_index
    • goal of changes is to check the authz parameter - if it is empty, then substitute it with all the Arborist resources they have access to. see

    • In order to do this we needed to fetch the current user's authorized resources from Arborist.

      • In order to do that, we updated the gen3authz dependency to use the latest version of gen3authz, which included methods to look up resources by token as opposed to by username see.
    • Now that we have the user's resources, we can check if the requested resource is in the list of resources they have access to.

      • We did not redesign the query logic:
        • The existing API and logic is: If the authz parameter is set, the records returned must be in ALL "projects" in specified in the authz parameter
        • We want to keep this behavior, so it shouldn't be breaking
        • However, if the authz parameter is empty, we will return all records that the user has access to, regardless of the "project" they are in.
        • This means that if a user has access to multiple projects, they will see records from all those projects when authz is empty.
        • See alchemy driver
        • We check the blueprint record for a rbac attribute, and if true, we retrieve resources and enforce access and add and ANY filter to query to limit unconstrained queries
    • All of the changes above:

      • should be transparent to the user, and they should not notice any difference in behavior.
      • should be non-breaking, as it only changes the behavior when the authz parameter is empty.
      • However, it will throw a 401/403 is the user does not have access to the requested resource,or does not have and Authorization header which is a change from the previous behavior where it would return all the records regardless of the user's access.
  • "Breaking" Changes:
    • In order to enforce authorization, we need to ensure that all records have an authz field. See
    • (This is not a change in behavior to OHSU/ACED/Calypr, but it is a change in behavior to the Indexd API in that effectively authz is mandatory on write)
  • Misc:
    • Added stack traces to log for unhandled exceptions see
  • Tests:
    • Introducing this changed required us to update the tests to account for the new behavior.
      • There are a lot of existing tests :-)
      • There are a lot of deprecated features that are still in the codebase :-(
      • All of the reads see client.get(...)
        res = client.get(alias_endpoint) # old
        res = client.get(alias_endpoint, headers=user) # new
        
      • There are legacy fields in indexd the indexd records that are no longer used,see authz vs acl
      • Since the authz field is now required, AND is tied to the user's Arborist resources, we need to ensure that all records have an authz field have proper values.
        • In most cases, authorized resource are: ["/programs/bpa/projects/UChicago", "/programs/other/projects/project"]
        • In most cases, the un-authorized resources are: ["/programs/forbidden/projects/project"]
        • All of these need to be updated in the tests to ensure that the authz field is set correctly.
      • Test fixtures:
        • There is no "Arborist" server, that is mocked out in the tests.
        • See conftest

@lbeckman314
Copy link

lbeckman314 commented Jun 12, 2025

Testing Steps 🌀

1. Start Postgres ✔️

➜ brew services start postgresql
==> Successfully started `postgresql@14` (label: homebrew.mxcl.postgresql@14)

➜ brew services list
Name              Status  User     File
postgresql@14     started beckmanl ~/Library/LaunchAgents/[email protected]

➜ psql postgres
psql (14.13 (Homebrew))
Type "help" for help.
postgres=# \c
You are now connected to database "postgres" as user "beckmanl".
postgres=# exit

2. Run Indexd Tests ✔️

➜ gh pr checkout 400
Switched to branch 'feature/rbac'

➜ poetry install
Installing the current project: indexd (5.1.2)

➜ poetry run pytest -vv --cov=indexd --cov-report xml tests
537 passed, 303 skipped, 5320 warnings in 1513.87s (0:25:13)

@lbeckman314

This comment was marked as duplicate.

@lbeckman314
Copy link

lbeckman314 commented Jun 16, 2025

Deployment Steps 🚀

1. Deploy Gen3 ✔️

Tip

Required deployment updates:

values.yaml
# Indexd configuration
indexd:
  image:
    repository: quay.io/ohsu-comp-bio/indexd
    tag: feature_rbac  # <---- point to this fork of Indexd (#400)
user.yaml
authz:
  resources:
  - name: services
    subresources:
    - name: indexd
      subresources:
      - name: admin   # <---- Defining the /services/indexd/admin resource

  policies:
  - id: indexd_admin
    description: full access to indexd API
    role_ids:
      - administrator
    resource_paths:
      - /programs
      - /data_file
      - /services/indexd/admin   # <---- Adding the resource to the indexd_admin policy
➜ git clone https://github.com/ACED-IDP/gen3-helm.git -b ohsu-develop

➜ cd gen3-helm

➜ helm dependency build ./helm/gen3

➜ helm upgrade --install local ./helm/gen3  -f values.yaml -f user.yaml

➜ kubectl get deployments/indexd-deployment
NAME                READY   UP-TO-DATE   AVAILABLE
indexd-deployment   1/1     1            1

2. Add Data File ✔️

➜ g3t init $(basename $PWD)echo 'Example Data' > example.txt

➜ g3t add example.txt --patient example

➜ g3t meta init

➜ g3t commit -m "test: add example file"

➜ g3t push

3. Current Behavior (RBAC Filtering disabled by default) ✔️

➜ curl -s https://calypr.ohsu.edu/index/index
{
  "records": [
    {
      "authz": [
        "/programs/cbds/projects/example"
      ],
      "did": "540b7f64-8e85-5ba1-9d8d-50ebe54e0632",
      "file_name": "example.txt",
    },
  ],
}

3. Enable RBAC Filtering

Tip

helm/indexd/indexd-settings/local_settings.py

CONFIG["RBAC"] = True

➜ helm dependency update ./helm/gen3 

➜ helm upgrade --install local ./helm/gen3 -f values.yaml -f user.yaml

➜ kubectl rollout restart deployment/indexd-deployment

4. New Behavior (RBAC Filtering enabled) ✔️

1. No Bearer Token ❌

/index

➜ curl -s https://calypr.ohsu.edu/index/index
{
  "error": "Authorization header is required for RBAC"
}

/ga4gh

➜ curl -s https://calypr.ohsu.edu/ga4gh/drs/v1/objects
{
  "msg": "Authorization header is required for RBAC",
  "status_code": 403
}

2. Invalid Bearer Token ⚠️

/index

➜ curl -s -H "Authorization: Bearer BAD_EXAMPLE" https://calypr.ohsu.edu/index/index
{
  "error": "Failed to get resources from Arborist. Please check your Arborist configuration."
}

/ga4gh

➜ curl -s -H "Authorization: Bearer BAD_EXAMPLE" https://calypr.ohsu.edu/ga4gh/drs/v1/objects
{
  "msg": "Failed to get resources from Arborist. Please check your Arborist configuration.",
  "status_code": 403
}

3. Valid Bearer Token ✅

Tip

Access Token retrieved after logging in to Frontend-Framework:

  • Chrome: Developer Tools > Application > Storage > Cookies > https://calypr.ohsu.edu > access_token

  • Firefox: Developer Tools > Storage > Cookies > https://calypr.ohsu.edu > access_token

export TOKEN=<access_token>

/index

➜ curl -s -H "Authorization: Bearer $TOKEN" https://calypr.ohsu.edu/index/index
{
  "records": [
    {
      "authz": [
        "/programs/cbds/projects/example"
      ],
      "did": "540b7f64-8e85-5ba1-9d8d-50ebe54e0632",
      "file_name": "example.txt",
    },
  ],
}

/ga4gh

➜ curl -s -H "Authorization: Bearer $TOKEN" https://calypr.ohsu.edu/ga4gh/drs/v1/objects
{
  "drs_objects": [
    {
      "access_methods": [
        {
          "access_id": "s3",
          "access_url": {
            "url": "s3://cbds/540b7f64-8e85-5ba1-9d8d-50ebe54e0632/example.txt"
          },
        }
      ],
      "name": "example.txt",
      "self_uri": "drs://PREFIX:540b7f64-8e85-5ba1-9d8d-50ebe54e0632",
    },
}

Environment ⚙️

Open Questions + Next Steps 🌀

  • How can we best test another user's access (e.g. mock/service user)?
  • Is the feature flag for RBAC filtering ENABLE_RBAC_FILTERING or RBAC tests/default_test_settings.py#41?
  • Update Indexd Config to read from environmental variable so this can be enabled/disabled via Helm values files
  • Check access to the /ga4gh/ and /index endpoints — do they comply with RBAC filtering are does one provide access while the other doesn't?

Additional Resources 📚

Happy 30th Birthday APOD! 🥳

@bwalsh
Copy link
Author

bwalsh commented Jun 18, 2025

Feature Request ⚙️

@lbeckman314 Can you remove the comment above re. the use case document? The request is here:
See updated https://docs.google.com/document/d/1tHFyI-s8N8DccJYnbfxo-hBgOrULms7DMOAmXtqhNeA/edit?usp=drive_web&ouid=110793006573203727769

@bwalsh
Copy link
Author

bwalsh commented Jun 18, 2025

Update 💥

  • removed extraneous DEBUG logging
  • moved rbac enforcement to alchemy driver

improve arborist check
adds additional checks

improve test_multiple_endpoints
Copy link
Contributor

@Avantol13 Avantol13 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The detail in the PR description is great, but it must follow our PR template to be parsed correctly. Please move any relevant documentation to a markdown file in the docs folder if you think it's widely useful going forward, otherwise, ensure the PR description follows our template.

You cannot include other markdown headings due to the automated parsing for our release notes, but you can include text above the templated headings with any additional information about the PR. PR template

This initial review is a cursory, high-level single read-through of the code itself and I have not done any setup or testing (which we will need to do eventually).


@blueprint.errorhandler(UserError)
def handle_user_error(err):
print(f"Uncaught Exception: {err}", file=sys.stderr)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use cdislogging, not direct prints

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

You can install Poetry. Make sure the virtual environment is activated.

```console
# Note: this method is deprecated, returns a 404.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you update the install markdown with their recommended method and remove this deprecated one

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

defer for now

try:
token = get_jwt_token()
if not token:
raise AuthzError("No JWT token found for authorization check")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arborist allows no token to be sent on purpose, it allows assignment of anonymous access. So we don't want to raise this error here

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

if not authorized:
token = get_jwt_token()
if not token:
raise AuthError("No JWT token found for authorization check")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see above comment

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

)
token = get_jwt_token()
try:
_ = self.arborist.auth_mapping(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't use _ as a variable name unless it's a return that is unused. Here we're returning it, so we need a name

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

assert data_all_by_md.status_code == 403, f"Expected status code 403, got {data_all_by_md.status_code}"


def test_multiple_endpoints(client, user, mock_arborist_requests, is_rbac_configured):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see previous test comment, we need to break this up into smaller, more focused tests

@@ -0,0 +1 @@
3.12
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

everything we run must be Python 3.9

can you remove this and reinstall and relock on 3.9?



@blueprint.errorhandler(AuthError)
def handle_requester_auth_error(err):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any new public method needs a Google-style docstring

from indexd.index.errors import NoRecordFound as IndexNoRecordFound
from indexd.errors import IndexdUnexpectedError
from indexd.utils import reverse_url
import traceback
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please ensure isort-style imports.

3 import sections:

  • python built-ins
  • third-party
  • within this code

and each one is alphabetically ordered

traceback is a built-in so it should be in the first block of imports

},
}

CONFIG["RBAC"] = False # RBAC is not enabled by default
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should consider a different name for this and a little more description here about how this configuration affects the runtime of the service (some instructions to an operator to understand what True/False really does)

I would maybe recommend ARE_RECORDS_DISCOVERABLE and default to True. Note in a comment that the records themselves contain only file metadata which includes required authorization for underlying files.

I also suspect there is a use case for authorizing the discovery of the records separately from the authorization required for the underlying files. Important to remember that the authz in indexd records currently is intended to represent the authorization required for the underlying files - not the record itself. And I can forsee a potential use of this feature being: no data is discoverable until you "register", then all data is discoverable but you have to apply to specific studies to get access to underlying data.

This solution as it stands is not flexible enough to support the above b/c it couples the authz for the underlying files with the authz to view the indexd record itself. I'm not convinced this is a super future-proof approach.

What we could consider are 2 configs, 1 to turn discovery off and one to toggle whether or not there's a separate authz for all records

ARE_RECORDS_DISCOVERABLE: False

# None below means that each record will be authorized based
# on the authz specified for the underlying files.
# If you set a global discovery authz, this OVERRIDES
# individual record authz for the purpose of discovery
# (e.g. reading the records). Importantly, it DOES NOT
# change any behavior with regards to the authz on the
# record controlling access to underlying data.
GLOBAL_DISCOVERY_AUTHZ: ["/indexd/discovery"]. # or None

I'd like to make sure we support something like this to keep things future proof. So if GLOBAL_DISCOVERY_AUTHZ is set, you ignore the authz on the record and use it instead (for ONLY GET/read records).

@bwalsh bwalsh mentioned this pull request Aug 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants