A helper to produce access (download) stats #39

yarikoptic · 2021-04-29T16:02:27Z

Raised by @satra in emails: we would need a way to get data access stats.

We have a bucket with logs for s3 access, and IMHO it would be the ultimate source of stats since we share not only dandi-api urls but also direct S3 urls. Also we do not store/keep all versions of data in draft, so girder and/or dandi-api DB would not have all that information.

So I guess we should resort to using datalad dandisets and information (URLs) stored in git-annex history. Since we only update on 'cron' - that information also could be incomplete but AFAIK it would be the best we can get ATM. In the longer run - we might want to establish
We can try to establish access stats from those s3 access logs, ideally filtering out our own access (for backup etc) probably based on IP(s). For that we would need to

sweep through all assets of all dandisets (well -- git annex whereis --json --all or alike)
use urls the returned values to map S3 paths to dandisets (ATM likely to be unique, i.e. blob:dandiset but might already be violated for some tiny files)

The problem would eventually come that we would not be able to uniquely map from blobs uuid to a specific dandiset whenever we start creating meta-dandisets. Then S3 logs would not be sufficient and ATM I do not see any way we could disambiguate really (e.g. if data just accessed by a direct S3 URL). We could add a logic though to assume the 'earliest' (lowest dandiset id or commit date across dandisets) to be the origin of a file or explicitly record in DB for which dandiset every new blob was originally added.

The text was updated successfully, but these errors were encountered:

yarikoptic mentioned this issue May 6, 2021

periodic audit dandi/dandi-archive#263

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A helper to produce access (download) stats #39

A helper to produce access (download) stats #39

yarikoptic commented Apr 29, 2021

A helper to produce access (download) stats #39

A helper to produce access (download) stats #39

Comments

yarikoptic commented Apr 29, 2021