Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

--mode all,random-asset #43

Closed
yarikoptic opened this issue Mar 6, 2023 · 16 comments · Fixed by #44
Closed

--mode all,random-asset #43

yarikoptic opened this issue Mar 6, 2023 · 16 comments · Fixed by #44
Assignees

Comments

@yarikoptic
Copy link
Member

Since it takes forever ATM to go through entire collection, I want to make it possible to "sparsely sample" by adding a run --mode, where default all -- is current behavior. In case of random-asset:

  • only 1 (nwb) asset is chosen from a dandiset
  • testing ran on that asset and results are added/replaced within assets_ok/assets_nok
    • if current testing environment versions differ from prior record in the "top level" version, a record would not be just a string containing the path, but becomes a record with path and versions fields, e.g.:
path: sub-YutaMouse57/sub-YutaMouse57_ses-YutaMouse57-161016_behavior+ecephys.nwb
versions:
 - hdmf: 3.4.7
 - matnwb: v2.5.0.0
 - pynwb: 2.2.0
  • while saving, analyze: if all paths have identical versions: (for those who have only path -- take "top level" versions) -- just place those identical versions to the top level versions: record and renormalize all paths to just string records.

While RF for this,

  • extend test_files (which run tests on a given file(s)) command with --save-results flag, so it would store the results as well (now they are just printed). It would allow for selective rerunning
  • optional : to avoid need for locking in case of some parallel execution of check in some mode and e.g. test_files -- RF code so storing of results from each file consists of reloading results file, modifying entry for that path, saving back. Caution would need to be taken to support removal of stale results - i.e. for paths which were removed already from dandisets, so might need recheck of all files being available as symlink or present?
@yarikoptic
Copy link
Member Author

another mode could be random-outdated-asset which would be the random asset (among ok/nok) which has versions which are not current. Then we could sweep with such option slowly getting to the most recent state without waiting for long single sweep to finish.

@yarikoptic
Copy link
Member Author

I like random-outdated-asset, make it into random-outdated-asset-first which would do random-outdated-asset but if none is outdated -- do random-asset. That is the only extra mode pretty much I would like to see ATM but wouldn't mind all 3 of them. This way we can just keep it running forever slowly going through all assets.

@jwodder
Copy link
Member

jwodder commented Mar 21, 2023

@yarikoptic To be clear, do you want random-outdated-asset, random-outdated-asset-first, or both?

@jwodder jwodder mentioned this issue Mar 21, 2023
3 tasks
@yarikoptic
Copy link
Member Author

I think we can be fine with random-outdated-asset-first only ATM in addition to all, and it should be easy after refactoring to add any other random- if really needed.

@jwodder
Copy link
Member

jwodder commented Mar 22, 2023

@yarikoptic

extend test_files (which run tests on a given file(s)) command with --save-results flag, so it would store the results as well

This would require the test_files command to know what Dandisets the assets belong to in order to save the results in the proper files. How should that work?

@yarikoptic
Copy link
Member Author

for each path it can discover the dandiset by location of the dandiset.yaml in the .parents.

@jwodder
Copy link
Member

jwodder commented Mar 22, 2023

@yarikoptic Should running test_files on files that aren't in a Dandiset be supported?

@yarikoptic
Copy link
Member Author

ideally yes . saving concerns only the --save-results mode of operation.
BTW it is ok if --save-results STATUS_FILE_PATH make to get a path to the status.yaml file where to update/save results so code doesn't even need to figure out where to save them. Would be explicit, and if not there -- just error out to not bother with initiating the full thing.

@jwodder
Copy link
Member

jwodder commented Mar 24, 2023

@yarikoptic Problem: When saving test results obtained with test-files, the script needs to know the various package versions so that they can be properly recorded in status.yaml. However, the MatNWB version is currently only known if MatNWB was downloaded & unpacked, which is only done if test-files was asked to test against MatNWB, as otherwise it's a waste of time. Leaving the MatNWB version empty when it's not known will lead to difficulties with managing and "normalizing" status.yaml-wide vs. per-asset versions.

One option to determine the MatNWB version when testing against pynwb would be to just query GitHub for the tag of MatNWB's latest release and not download any code, but this could lead to status.yaml files being normalized less often than possible if all recorded versions list the same older MatNWB version, in which case we would then record a newer version for just the tested asset(s) that wasn't even used for the test.

@yarikoptic
Copy link
Member Author

I asked chatgpt to give me a function which would do git describe --tags in the directory where a function defined, it gave me

function tag = getGitTagForFunction(funcName)
% This function returns the output of "git describe --tags" command for
% the directory containing the file defining the input function.

% Get the path to the function file
funcFile = which(funcName);

% Get the path to the directory containing the function file
funcDir = fileparts(funcFile);

% Run the "git describe --tags" command in the function directory
[status, result] = system(sprintf('cd %s && git describe --tags', funcDir));

% If the command ran successfully, parse the output to get the tag
if status == 0
    tag = strtrim(result);
else
    error('Error running "git describe --tags" command: %s', result);
end
end

which I tested to work on drogon:

(base) dandi@drogon:/tmp$ MATLABPATH=/tmp/matnwb/clone/:/tmp/mymat matlab -nodesktop -batch "disp(getGitTagForFunction('nwbRead'))"
v2.6.0.1-2-gda49922

so if we switch to use clone of the matnwb repo -- it should work.

As for master export -- I do not see how they could have potentially made it work to provide a version based on the tags, and we also don't in those projects which use versioneer and your versioningit, right?

@jwodder
Copy link
Member

jwodder commented Mar 27, 2023

@yarikoptic Why should we switch to using a clone of the matnwb repo? That would mean we'd frequently be using unreleased code, which seems like not the best idea. It'd also increase the number of different matnwb versions listed in status.yaml files, hindering normalization.

Moreover, if we're using a clone of the matnwb repo, what's the point of running a matlab script to get the git describe output instead of just running the command in the clone?

Honestly, I prefer my suggested solution of querying GitHub for the latest matnwb version whenever matnwb isn't downloaded.

@yarikoptic
Copy link
Member Author

To say the truth I forgot that we are using released version ATM, but I remain of an opinion that a clone would be a better choice: I don't like querying because it heavily relies on assumption that whatever we installed elsewhere has that version.

  • with current run time of weeks the version we actually use might be not the one queried later
  • it should be easy to tune up install_matnwb() to clone and still checkout that recent release. As a result we would gain ability to check exactly the version we are actually using instead of what is the current (possibly already different) release
  • for Make it possible to test in "development" scenario #20 we would need to test against different (master) version. It would be trivial to parametrize that function and stay at current master and again keep consistent discoverable version information

@jwodder
Copy link
Member

jwodder commented Mar 27, 2023

@yarikoptic So, if we switch to cloning, exactly when should the script update the clone to use the latest release?

@yarikoptic
Copy link
Member Author

ATM we have at

if MATNWB_INSTALL_DIR.exists():
to completely skip installing if target folder exists. We trigger install_matnwb at the beginning of the sweep (check) and in test_files (if testname includes matnwb). so it seems that ATM we do not upgrade at all if installation already exists. For the purpose of this issue could be just that, no updates etc.

For the #20 (comment) we would anyways create some kind of scenarios to switch between and decide when to update which likely would be some separate step.

@jwodder
Copy link
Member

jwodder commented Mar 27, 2023

@yarikoptic That's not what that line does; if the matnwb folder already exists, it is deleted, and then (regardless of whether it existed or not) the latest release is downloaded & installed.

@yarikoptic
Copy link
Member Author

doh - how could I misread rmtree ;) well -- FWIW can just keep doing the same for now anyways. My point is that the aspect of update is orthogonal to either you get it from tarball or git clone but with git clone we can discover the version in use.

yarikoptic added a commit that referenced this issue Mar 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants