Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .python-version
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
3.12
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

everything we run must be Python 3.9

can you remove this and reinstall and relock on 3.9?

19 changes: 18 additions & 1 deletion docs/local_dev_environment.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,9 +35,11 @@ python3 -m venv py38-venv
You can install Poetry. Make sure the virtual environment is activated.

```console
# Note: this method is deprecated, returns a 404.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you update the install markdown with their recommended method and remove this deprecated one

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

defer for now

curl -sSL https://raw.githubusercontent.com/python-poetry/poetry/master/get-poetry.py | python
source $HOME/.poetry/env
```
See [poetry docs](https://python-poetry.org/docs/#installing-with-pipx)

You can install python dependencies using Poetry:

Expand Down Expand Up @@ -72,7 +74,22 @@ sudo pg_ctlcluster 12 main start
If you're on mac, you can install postgres using brew:

```console
brew install postgres
brew install postgresql
```
Start the PostgreSQL service: Use Homebrew to start the PostgreSQL service:

```console
# start
brew services start postgresql

# verify
brew services list

# connect
psql postgres

# stop
brew services stop postgresql
```

You may also need to create a user.
Expand Down
14 changes: 14 additions & 0 deletions indexd/alias/blueprint.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,7 @@
import re
import sys
import traceback

import flask
import jsonschema

Expand Down Expand Up @@ -159,11 +162,15 @@ def handle_multiple_records_error(err):

@blueprint.errorhandler(UserError)
def handle_user_error(err):
print(f"Uncaught Exception: {err}", file=sys.stderr)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use cdislogging, not direct prints

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

traceback.print_exc(file=sys.stderr)
return flask.jsonify(error=str(err)), 400


@blueprint.errorhandler(AuthError)
def handle_auth_error(err):
print(f"AuthError: {err}", file=sys.stderr)
traceback.print_exc(file=sys.stderr)
return flask.jsonify(error=str(err)), 403


Expand All @@ -172,6 +179,13 @@ def handle_revision_mismatch(err):
return flask.jsonify(error=str(err)), 409


@blueprint.errorhandler(Exception)
def handle_uncaught_exception(err):
print(f"Uncaught Exception: {err}", file=sys.stderr)
traceback.print_exc(file=sys.stderr)
return flask.jsonify(error=f"Internal server error {type(err)} {err}"), 500


@blueprint.record
def get_config(setup_state):
config = setup_state.app.config["ALIAS"]
Expand Down
23 changes: 16 additions & 7 deletions indexd/auth/__init__.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,11 @@
import sys
from functools import wraps

from flask import current_app
from flask import request

from .errors import AuthError
from .errors import AuthError, AuthzError
from ..errors import UserError
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it was like this before, but can we switch to absolute imports instead of relative? Those are preferred

Copy link
Author

@bwalsh bwalsh Aug 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done



def authorize(*p):
Expand All @@ -20,8 +22,8 @@ def authorize(*p):

@wraps(f)
def check_auth(*args, **kwargs):
if not request.authorization:
raise AuthError("Username / password required.")
if not request.authorization.parameters.get("username"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this necessary to update?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

raise AuthError(f"Basic auth Username / password required. {request.authorization}")
current_app.auth.auth(
request.authorization.parameters.get("username"),
request.authorization.parameters.get("password"),
Expand All @@ -31,13 +33,20 @@ def check_auth(*args, **kwargs):

return check_auth
else:
method, resources = p
method, resources_ = p
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why the suffix _? is that a reserved keyword?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shadows name 'resources' from outer scope (the new method)

if request.authorization and request.authorization.type == "basic":
current_app.auth.auth(
request.authorization.parameters.get("username"),
request.authorization.parameters.get("password"),
)
else:
if not isinstance(resources, list):
raise UserError(f"'authz' must be a list, received '{resources}'.")
current_app.auth.authz(method, list(set(resources)))
if not isinstance(resources_, list):
raise UserError(f"'authz' must be a list, received '{resources_}'.")
return current_app.auth.authz(method, list(set(resources_)))


def resources():
"""
Returns a list of resources the user has access to. Uses Arborist if available.
"""
return current_app.auth.resources()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we likely need to consider caching - and this will be tricky b/c it's authz information.

I expect this change is going to clog the network if indexd is making the same requests to arborist for 10k files when a user is requesting signed URLs.

Can you do some performance evaluation of previous state and future state (where every record is controlled)?

We have some load testing scripts, but I'm not certain they're all up to date to work with the Helm setup. But if you want inspiration: https://github.com/uc-cdis/gen3-code-vigil/blob/master/gen3-load-tests/load_testing_scripts/ga4gh-drs-performance.js

Ultimately, I don't want to push this as-is if we can't scale this to work for the same scenario as above (10k files over 3 studies). We may need to consider how to improve efficiency

8 changes: 8 additions & 0 deletions indexd/auth/driver.py
Original file line number Diff line number Diff line change
Expand Up @@ -43,3 +43,11 @@ def delete(self, username):
Raises AuthError if user doesn't exist.
"""
raise NotImplementedError("TODO")

@abc.abstractmethod
def resources(self):
"""
Returns a list of resources the user has access to. Uses Arborist if available.
Raises AuthError if the user doesn't exist.
"""
raise NotImplementedError("TODO")
40 changes: 37 additions & 3 deletions indexd/auth/drivers/alchemy.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
import hashlib
import sys

from contextlib import contextmanager

Expand Down Expand Up @@ -133,18 +134,25 @@ def authz(self, method, resource):
try:
# A successful call from arborist returns a bool, else returns ArboristError
try:
token = get_jwt_token()
if not token:
raise AuthzError("No JWT token found for authorization check")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arborist allows no token to be sent on purpose, it allows assignment of anonymous access. So we don't want to raise this error here

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

authorized = self.arborist.auth_request(
get_jwt_token(), "indexd", method, resource
token, "indexd", method, resource
)
except Exception as e:
logger.error(
f"Request to Arborist failed; now checking admin access. Details:\n{e}"
)
authorized = False

if not authorized:
token = get_jwt_token()
if not token:
raise AuthError("No JWT token found for authorization check")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see above comment

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

# admins can perform all operations
is_admin = self.arborist.auth_request(
get_jwt_token(), "indexd", method, ["/services/indexd/admin"]
token, "indexd", method, ["/services/indexd/admin"]
)
if not is_admin and not resource:
# if `authz` is empty (no `resource`), admin == access to
Expand All @@ -157,7 +165,33 @@ def authz(self, method, resource):
"The indexd admin '/programs' logic is deprecated. Please update your policy to '/services/indexd/admin'"
)
if not is_admin:
raise AuthError("Permission denied.")
raise AuthError("Permission denied. (not is_admin)")
except AuthError as err:
logger.error(err)
raise err
except AuthzError as err:
logger.error(err)
raise err
except Exception as err:
logger.error(err)
raise AuthzError(err)

def resources(self):
"""
Returns a list of resources for the given user.
"""
if not self.arborist:
raise AuthError(
"Arborist is not configured; cannot perform authorization check"
)
token = get_jwt_token()
try:
_ = self.arborist.auth_mapping(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't use _ as a variable name unless it's a return that is unused. Here we're returning it, so we need a name

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

jwt=token
)
return _
except Exception as err:
raise AuthError(
"Failed to get resources from Arborist. Please check your Arborist configuration."
)

21 changes: 20 additions & 1 deletion indexd/blueprint.py
Original file line number Diff line number Diff line change
@@ -1,10 +1,13 @@
import sys

import flask

from indexclient.client import IndexClient
from doiclient.client import DOIClient
from dosclient.client import DOSClient
from hsclient.client import HSClient

from indexd.auth import AuthzError
from indexd.utils import hint_match

from indexd.errors import AuthError
Expand All @@ -20,6 +23,15 @@
blueprint.dist = []


@blueprint.errorhandler(Exception)
def handle_uncaught_exception(err):
import traceback
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see previous comment about in-function imports

import sys
print("Uncaught Exception:", file=sys.stderr)
traceback.print_exc(file=sys.stderr)
return flask.jsonify(error="Internal server error"), 500


@blueprint.route("/alias/<path:alias>", methods=["GET"])
def get_alias(alias):
"""
Expand Down Expand Up @@ -60,7 +72,6 @@ def get_record(record):
if not blueprint.dist or "no_dist" in flask.request.args:
raise
ret = dist_get_record(record)

return flask.jsonify(ret), 200


Expand Down Expand Up @@ -111,6 +122,11 @@ def handle_auth_error(err):
return flask.jsonify(error=str(err)), 403


@blueprint.errorhandler(AuthzError)
def handle_authz_error(err):
return flask.jsonify(error=str(err)), 401


@blueprint.errorhandler(AliasNoRecordFound)
def handle_no_record_error(err):
return flask.jsonify(error=str(err)), 404
Expand All @@ -129,3 +145,6 @@ def get_config(setup_state):
blueprint.alias_driver = alias_config["driver"]
if "DIST" in setup_state.app.config:
blueprint.dist = setup_state.app.config["DIST"]
blueprint.rbac = False
if "RBAC" in setup_state.app.config:
blueprint.rbac = setup_state.app.config["RBAC"]
2 changes: 2 additions & 0 deletions indexd/default_settings.py
Original file line number Diff line number Diff line change
Expand Up @@ -83,6 +83,8 @@
},
}

CONFIG["RBAC"] = False # RBAC is not enabled by default
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should consider a different name for this and a little more description here about how this configuration affects the runtime of the service (some instructions to an operator to understand what True/False really does)

I would maybe recommend ARE_RECORDS_DISCOVERABLE and default to True. Note in a comment that the records themselves contain only file metadata which includes required authorization for underlying files.

I also suspect there is a use case for authorizing the discovery of the records separately from the authorization required for the underlying files. Important to remember that the authz in indexd records currently is intended to represent the authorization required for the underlying files - not the record itself. And I can forsee a potential use of this feature being: no data is discoverable until you "register", then all data is discoverable but you have to apply to specific studies to get access to underlying data.

This solution as it stands is not flexible enough to support the above b/c it couples the authz for the underlying files with the authz to view the indexd record itself. I'm not convinced this is a super future-proof approach.

What we could consider are 2 configs, 1 to turn discovery off and one to toggle whether or not there's a separate authz for all records

ARE_RECORDS_DISCOVERABLE: False

# None below means that each record will be authorized based
# on the authz specified for the underlying files.
# If you set a global discovery authz, this OVERRIDES
# individual record authz for the purpose of discovery
# (e.g. reading the records). Importantly, it DOES NOT
# change any behavior with regards to the authz on the
# record controlling access to underlying data.
GLOBAL_DISCOVERY_AUTHZ: ["/indexd/discovery"]. # or None

I'd like to make sure we support something like this to keep things future proof. So if GLOBAL_DISCOVERY_AUTHZ is set, you ignore the authz on the record and use it instead (for ONLY GET/read records).


AUTH = SQLAlchemyAuthDriver(
"postgresql://postgres:postgres@localhost:5432/indexd_tests" # pragma: allowlist secret
)
Expand Down
27 changes: 27 additions & 0 deletions indexd/dos/blueprint.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,9 @@
import sys
import traceback

import flask

from indexd.auth import AuthzError
from indexd.blueprint import dist_get_record

from indexd.errors import AuthError
Expand All @@ -15,12 +19,32 @@
blueprint.dist = []


@blueprint.errorhandler(Exception)
def handle_uncaught_exception(err):
print(f"Uncaught Exception: {err}", file=sys.stderr)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see previous comments regarding logging. we prefer not to use print statements directly.

traceback.print_exc(file=sys.stderr)
return flask.jsonify(error=f"Internal server error"), 500


@blueprint.errorhandler(AuthzError)
def handle_authz_error(err):
ret = {"msg": str(err), "status_code": 401}
return flask.jsonify(ret), 401


@blueprint.errorhandler(AuthError)
def handle_requester_auth_error(err):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any new public method needs a Google-style docstring

ret = {"msg": str(err), "status_code": 403}
return flask.jsonify(ret), 403


@blueprint.route("/ga4gh/dos/v1/dataobjects/<path:record>", methods=["GET"])
def get_dos_record(record):
"""
Returns a record from the local ids, alias, or global resolvers.
Returns DOS Schema
"""

try:
ret = blueprint.index_driver.get(record)
# record may be a baseID or a DID / GUID. If record is a baseID, ret["did"] is the latest GUID for that record.
Expand Down Expand Up @@ -157,3 +181,6 @@ def get_config(setup_state):
blueprint.alias_driver = alias_config["driver"]
if "DIST" in setup_state.app.config:
blueprint.dist = setup_state.app.config["DIST"]
blueprint.rbac = False
if "RBAC" in setup_state.app.config:
blueprint.rbac = setup_state.app.config["RBAC"]
13 changes: 13 additions & 0 deletions indexd/drs/blueprint.py
Original file line number Diff line number Diff line change
@@ -1,12 +1,15 @@
import os
import re
import sys

import flask
import json
from indexd.errors import AuthError, AuthzError
from indexd.errors import UserError
from indexd.index.errors import NoRecordFound as IndexNoRecordFound
from indexd.errors import IndexdUnexpectedError
from indexd.utils import reverse_url
import traceback
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please ensure isort-style imports.

3 import sections:

  • python built-ins
  • third-party
  • within this code

and each one is alphabetically ordered

traceback is a built-in so it should be in the first block of imports


blueprint = flask.Blueprint("drs", __name__)

Expand Down Expand Up @@ -369,6 +372,13 @@ def handle_unexpected_error(err):
return flask.jsonify(ret), err.code


@blueprint.errorhandler(Exception)
def handle_uncaught_exception(err):
print(f"Uncaught Exception: {err}", file=sys.stderr)
traceback.print_exc(file=sys.stderr)
return flask.jsonify(error=f"Internal server error"), 500


@blueprint.record
def get_config(setup_state):
index_config = setup_state.app.config["INDEX"]
Expand All @@ -379,3 +389,6 @@ def get_config(setup_state):
def get_config(setup_state):
if "DRS_SERVICE_INFO" in setup_state.app.config:
blueprint.service_info = setup_state.app.config["DRS_SERVICE_INFO"]
blueprint.rbac = False
if "RBAC" in setup_state.app.config:
blueprint.rbac = setup_state.app.config["RBAC"]
Loading