You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Raised by @satra in emails: we would need a way to get data access stats.
We have a bucket with logs for s3 access, and IMHO it would be the ultimate source of stats since we share not only dandi-api urls but also direct S3 urls. Also we do not store/keep all versions of data in draft, so girder and/or dandi-api DB would not have all that information.
So I guess we should resort to using datalad dandisets and information (URLs) stored in git-annex history. Since we only update on 'cron' - that information also could be incomplete but AFAIK it would be the best we can get ATM. In the longer run - we might want to establish
We can try to establish access stats from those s3 access logs, ideally filtering out our own access (for backup etc) probably based on IP(s). For that we would need to
sweep through all assets of all dandisets (well -- git annex whereis --json --all or alike)
use urls the returned values to map S3 paths to dandisets (ATM likely to be unique, i.e. blob:dandiset but might already be violated for some tiny files)
The problem would eventually come that we would not be able to uniquely map from blobs uuid to a specific dandiset whenever we start creating meta-dandisets. Then S3 logs would not be sufficient and ATM I do not see any way we could disambiguate really (e.g. if data just accessed by a direct S3 URL). We could add a logic though to assume the 'earliest' (lowest dandiset id or commit date across dandisets) to be the origin of a file or explicitly record in DB for which dandiset every new blob was originally added.
The text was updated successfully, but these errors were encountered:
Raised by @satra in emails: we would need a way to get data access stats.
We have a bucket with logs for s3 access, and IMHO it would be the ultimate source of stats since we share not only dandi-api urls but also direct S3 urls. Also we do not store/keep all versions of data in
draft
, so girder and/or dandi-api DB would not have all that information.So I guess we should resort to using datalad dandisets and information (URLs) stored in git-annex history. Since we only update on 'cron' - that information also could be incomplete but AFAIK it would be the best we can get ATM. In the longer run - we might want to establish
We can try to establish access stats from those s3 access logs, ideally filtering out our own access (for backup etc) probably based on IP(s). For that we would need to
git annex whereis --json --all
or alike)The problem would eventually come that we would not be able to uniquely map from
blobs
uuid to a specific dandiset whenever we start creating meta-dandisets. Then S3 logs would not be sufficient and ATM I do not see any way we could disambiguate really (e.g. if data just accessed by a direct S3 URL). We could add a logic though to assume the 'earliest' (lowest dandiset id or commit date across dandisets) to be the origin of a file or explicitly record in DB for which dandiset every new blob was originally added.The text was updated successfully, but these errors were encountered: