Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fastadi submission #3

Open
mpadge opened this issue Oct 6, 2020 · 13 comments
Open

fastadi submission #3

mpadge opened this issue Oct 6, 2020 · 13 comments

Comments

@mpadge
Copy link
Member

mpadge commented Oct 6, 2020

Thanks @alexpghayes for volunteering to submit your fastadi package for trial "soft submission" to rOpenSci's new system for peer review of statistical software. This issue includes output from our automated assement and reporting tools developed as part of this new system. These currently function as the following two distinct components (which will be integrated later):

packgraph

library (packgraph)
packageVersion ("packgraph")
#> [1] '0.0.0.8'
package <- "/<local>/<path>/<to>/fastadi"

g <- pg_graph (package, plot = FALSE)
pg_report (g)
#> ══ fastadi ═════════════════════════════════════════════════════════════════════
#> 
#> The package has 4 exported functions, and 35 non-exported funtions. The
#> exported functions are structured into the following 1 primary cluster
#> containing 2 functions
#> 
#> 
#> | cluster|  n|name                | num_params| num_doc_words| num_doc_lines| num_example_lines| centrality|
#> |-------:|--:|:-------------------|----------:|-------------:|-------------:|-----------------:|----------:|
#> |       1|  1|adaptive_imputation |          5|           140|            18|                 0|          5|
#> |       1|  2|adaptive_initialize |          4|           232|             0|                 7|         NA|
#> 
#> There are also 2 isolated functions:
#> 
#> 
#> |  n|name            | loc|
#> |--:|:---------------|---:|
#> |  1|adaptive_impute |  19|
#> |  2|citation_impute |  19|
#> 
#> ── Documentation of non-exported functions ─────────────────────────────────────
#> 
#> 
#> |value  | doclines| cmtlines|
#> |:------|--------:|--------:|
#> |mean   |      4.9|      1.5|
#> |median |      0.0|      0.0|

autotest

library (autotest)
packageVersion ("autotest")
#> [1] '0.0.1.250'
x <- autotest_package (package)
#> Loading fastadi
#> Loading required package: Matrix
#> Loading required package: LRMF3
#> ✔ [1 / 4]: adaptive_impute
#> ✔ [2 / 4]: adaptive_impute
#> ✔ [3 / 4]: adaptive_initialize
#> ✔ [4 / 4]: citation_impute
x$yaml_hash <- substring (x$yaml_hash, 1, 6)
knitr::kable (x, row.names = TRUE)
type fn_name parameter operation content yaml_hash
1 diagnostic adaptive_impute rank ascertain integer range Parameter [rank] responds to integer values in [0, 22] c23cbd
2 diagnostic adaptive_impute rank match integer range to documentation Parameter range for rank is NOT documented c23cbd
3 diagnostic adaptive_impute max_iter ascertain integer range Parameter [max_iter] permits unrestricted integer inputs c23cbd
4 diagnostic adaptive_impute max_iter length 2 vector for single-length parameter parameter [max_iter] is assumed to be a single value of integer type, yet admits vectors of length > 1 c23cbd
5 diagnostic adaptive_impute check_interval ascertain integer range Parameter [check_interval] permits unrestricted integer inputs c23cbd
6 diagnostic adaptive_impute check_interval length 2 vector for single-length parameter parameter [check_interval] is assumed to be a single value of integer type, yet admits vectors of length > 1 c23cbd
7 diagnostic adaptive_impute rank ascertain integer range Parameter [rank] responds to integer values in [0, 22] 747b85
8 diagnostic adaptive_impute rank match integer range to documentation Parameter range for rank is NOT documented 747b85
9 diagnostic adaptive_impute max_iter ascertain integer range Parameter [max_iter] permits unrestricted integer inputs 747b85
10 diagnostic adaptive_impute max_iter length 2 vector for single-length parameter parameter [max_iter] is assumed to be a single value of integer type, yet admits vectors of length > 1 747b85
11 diagnostic adaptive_impute initialization upper case character parameter Parameter initialization of function [adaptive_impute] is assumed to a single character, but is case dependent 747b85
12 diagnostic adaptive_impute initialization length 2 vector for single-length parameter parameter [initialization] is assumed to be a single value of character type, yet admits vectors of length > 1 747b85
13 diagnostic adaptive_impute check_interval ascertain integer range Parameter [check_interval] permits unrestricted integer inputs 747b85
14 diagnostic adaptive_impute check_interval length 2 vector for single-length parameter parameter [check_interval] is assumed to be a single value of integer type, yet admits vectors of length > 1 747b85
15 diagnostic adaptive_initialize rank ascertain integer range Parameter [rank] responds to integer values in [0, 215] ee7464
16 diagnostic adaptive_initialize rank match integer range to documentation Parameter range for rank is NOT documented ee7464
17 diagnostic citation_impute rank ascertain integer range Parameter [rank] responds to integer values in [0, 28] 2cd207
18 diagnostic citation_impute rank match integer range to documentation Parameter range for rank is NOT documented 2cd207
19 diagnostic citation_impute max_iter ascertain integer range Parameter [max_iter] permits unrestricted integer inputs 2cd207
20 diagnostic citation_impute max_iter length 2 vector for single-length parameter parameter [max_iter] is assumed to be a single value of integer type, yet admits vectors of length > 1 2cd207
21 diagnostic citation_impute check_interval ascertain integer range Parameter [check_interval] permits unrestricted integer inputs 2cd207
22 diagnostic citation_impute check_interval length 2 vector for single-length parameter parameter [check_interval] is assumed to be a single value of integer type, yet admits vectors of length > 1 2cd207
23 diagnostic citation_impute initialization upper case character parameter Parameter initialization of function [citation_impute] is assumed to a single character, but is case dependent 2cd207
24 diagnostic citation_impute initialization length 2 vector for single-length parameter parameter [initialization] is assumed to be a single value of character type, yet admits vectors of length > 1 2cd207
25 error adaptive_imputation NA This function has no documented example NA
summary (x)
#> autotesting package [fastadi, v0.0.0.9007] generated 25 rows of output of the following types:
#>      1 error
#>      0 warnings
#>      0 messages
#>      24 other diagnosticss
#> That corresponds to 8 messages per documented function (which has examples)
#> 
#>               fn_name num_errors num_warnings num_messages num_diagnostics
#> 1 adaptive_imputation          1           NA           NA              NA
#> 2     adaptive_impute         NA           NA           NA              14
#> 3 adaptive_initialize         NA           NA           NA               2
#> 4     citation_impute         NA           NA           NA               8
#> 
#> In addition to the values in that table, the output includes 1 functions which have no documented examples:
#>     1. adaptive_imputation
#> 
#>     git hash for package as analysed here:
#>     [26f7c132a4068a3a105f5518aad2692b87e4ddb1]

Created on 2020-10-06 by the reprex package (v0.3.0)

Most of these diagnostic messages are about the admissable ranges of single-value parameters, and can probably be safely ignored, or maybe taken to indicate a minor need to tweak documentation of the parameters to indicate expected or admitted ranges. The parsing of description entries by autotest to estimate stated ranges is currently fairly crude, so updates of documentation on your side to address these could provide useful test cases for refinement of those procedures.

Those aside, the only issues are generally lack of control for lengths of parameters presumed to be single-valued, and matching character arguments regardless of case. If you could please ping here once you've addressed those, we'll post an updated autotest output and proceed to subsequent stages of the review process. Thanks!


Further information on autotest output

The output of autotest includes a column yaml_hash. This in turn refers to the yaml specification used to generate the autotests, which can be generated locally by running examples_to_yaml (<path>/<to>/<package>). Those contain the yaml_hash values, and finding the matching value will show you the base code used to trigger the diagnostic messages. The operation column should then provide a sufficient description of what has been mutated with regard to the structure defined in the yaml.

@alexpghayes
Copy link

alexpghayes commented Oct 6, 2020

autotests results have been dealt with -- there are some other ongoing build failures related to vignettes and a convergence criterion. I believe things should be in a state where reviewers can give this a look, but I am also unable to replicate the convergence issues locally on my computer and reviewers might run into that immediately, just FYI.

Update: I've dealt with both of these issues.

@mpadge
Copy link
Member Author

mpadge commented Oct 7, 2020

Thanks @alexpghayes. I still see some remaining autotest things, condensed here for brevity to what I deem as the relevant ones. (The remainder are primarily diagnostic messages about ranges for parameters like rank and max_iter; admissable ranges exist in practice, yet the documentation makes no indication of any ranges, so maybe adding some kind of helpful advice in the documentation on likely or expected ranges for these parameters would be useful.)

type fn_name parameter operation content yaml_hash
5 diagnostic adaptive_impute rank length 2 vector for single-length parameter parameter [rank] is assumed to be a single value of integer type, yet admits vectors of length > 1 c23cbd
8 diagnostic adaptive_impute max_iter length 2 vector for single-length parameter parameter [max_iter] is assumed to be a single value of integer type, yet admits vectors of length > 1 c23cbd
11 diagnostic adaptive_impute check_interval length 2 vector for single-length parameter parameter [check_interval] is assumed to be a single value of integer type, yet admits vectors of length > 1 c23cbd
16 diagnostic adaptive_impute rank length 2 vector for single-length parameter parameter [rank] is assumed to be a single value of integer type, yet admits vectors of length > 1 747b85
19 diagnostic adaptive_impute max_iter length 2 vector for single-length parameter parameter [max_iter] is assumed to be a single value of integer type, yet admits vectors of length > 1 747b85
20 diagnostic adaptive_impute initialization lower case character parameter Parameter initialization of function [adaptive_impute] is assumed to a single character, but is case dependent 747b85
21 diagnostic adaptive_impute initialization upper case character parameter Parameter initialization of function [adaptive_impute] is assumed to a single character, but is case dependent 747b85
22 diagnostic adaptive_impute initialization length 2 vector for single-length parameter parameter [initialization] is assumed to be a single value of character type, yet admits vectors of length > 1 747b85
25 diagnostic adaptive_impute check_interval length 2 vector for single-length parameter parameter [check_interval] is assumed to be a single value of integer type, yet admits vectors of length > 1 747b85
32 diagnostic citation_impute rank length 2 vector for single-length parameter parameter [rank] is assumed to be a single value of integer type, yet admits vectors of length > 1 2cd207
35 diagnostic citation_impute max_iter length 2 vector for single-length parameter parameter [max_iter] is assumed to be a single value of integer type, yet admits vectors of length > 1 2cd207
38 diagnostic citation_impute check_interval length 2 vector for single-length parameter parameter [check_interval] is assumed to be a single value of integer type, yet admits vectors of length > 1 2cd207
39 diagnostic citation_impute initialization lower case character parameter Parameter initialization of function [citation_impute] is assumed to a single character, but is case dependent 2cd207
40 diagnostic citation_impute initialization upper case character parameter Parameter initialization of function [citation_impute] is assumed to a single character, but is case dependent 2cd207
41 diagnostic citation_impute initialization length 2 vector for single-length parameter parameter [initialization] is assumed to be a single value of character type, yet admits vectors of length > 1 2cd207
summary (x)
#> autotesting package [fastadi, v0.0.0.9011] generated 41 rows of output of the following types:
#>      0 errors
#>      3 warnings
#>      3 messages
#>      35 other diagnosticss
#> That corresponds to 13.667 messages per documented function (which has examples)
#> 
#>               fn_name num_errors num_warnings num_messages num_diagnostics
#> 1     adaptive_impute         NA            2            2              21
#> 2 adaptive_initialize         NA           NA           NA               2
#> 3     citation_impute         NA            1            1              12
#> 
#>     git hash for package as analysed here:
#>     [27e052cc42d6a7e618eb111e31ba97d14ae7d59c]

Created on 2020-10-07 by the reprex package (v0.3.0)

@alexpghayes
Copy link

All of these are addressed in my commit from above, but perhaps not in ways that autotest is aware of?

@mpadge
Copy link
Member Author

mpadge commented Oct 7, 2020

oh great, then that'll give me a chance to compare currently coded autotest expectations with your actual implementation. Will report back...

@mpadge
Copy link
Member Author

mpadge commented Oct 8, 2020

Reporting back: It was a bug in autotest. Everything now checks clear except that the parameter initialization of citation_impute() is case-dependent. That's obviously very superficial, so up to you whether you want to address it or not. Next steps will appear here very soon. I'll also delete the autotest output from above, along with the previous couple of comments, if that's okay with you?

@alexpghayes
Copy link

That's a little weird since citation_impute() is essentially a copy-paste of adaptive_impute() with slightly different computations, but equivalent error checking. I'm happy to ignore the case matching, I think that's a pretty low priority given that I already use arg.match().

@mpadge
Copy link
Member Author

mpadge commented Oct 16, 2020

Thanks @alexpghayes for your submission and improvements that have already been made to the package. Don't worry about the remaining autotest output. Most of it can be ignored in this case, and the few things that can't are ultimately going to be optional (like case-matching). The testing regime is currently hard-coded, but will soon allow user-specified regimes to be applied. We also plan to implement a system to record written explanations for any categories of tests you choose to disable, with your justifications ultimately incorporated within initial package submission/pre-review templates. None of that is yet possible, so we're happy to just move on, but will likely re-visit with autotest at a later stage, for which we have all the relevant git hash information.

Given that, we would now like to proceed to the formal review stage, for which members of the project's advisory board @bbolker and @topepo have kindly agreed to review your package. They are now requested to perform a two-stage review, the first part involving assessment of the package against the standards as they are currently drafted, with the second being a more "traditional" review. We hope, by the time we proceed to this second component, that many aspects which might otherwise have arisen within a "traditional" unstructured review will already have been covered, and will thereby make the review process notably easier.

Our review system will ultimately perform much of the preceding automated assessment prior to actual submission, and reviewers will be provided with a high-level interactive graphical overview of the package's functions and their inter-relationships. In lieu of the system being that far, reviewers can clone Alex's repo from github.com/RoheLab/fastadi, then run the following three lines in the brolgar directory:

remotes::install_github("mpadge/packgraph")
library(packgraph)
pg_graph(".")

That should give you an interactive version something like this:

image


Instructions for review

@bbolker and @topepo, could you please now asses the fastadi package with respect to the current General Standards for Statistical Software, and the category-specific standards for Unsupervised Learning Software.

Please do this in two phases:

  • Review the package as it stood at the time of initial assessment (git hash 26f7c1, as given above).
  • Having done that, checkout package in current status, and note any differences, along with the git hash at that stage.

In each case, please only note those standards which you judge the package not to conform to, along with a description of what you would expect this particular software package to do in order to conform to each standard. When you do that, please provide sufficient information on which standard you are referring to. (The standards themselves are all enumerated, but not yet at a necessarily stable state, so please provide enough information for anyone to clearly know which standard you are referring to regardless of potential changes in nomenclature.) Please also note as a separate list all those standards which you think should not apply to this package, along with brief explanations of why.

Importantly, to aid us in refining the standards which will ultimately guide the peer review of statistical software, we also ask you to please consider whether you perceive any aspects of software (design, functionality, algorithmic implementations or applications, testing, and any other aspects you can think of) which you think might be able to be addressed by standards, and yet which are not addressed by our standards in their current form.

In particular, we note that the nominated category "Dimensionality Reduction, Clustering, and Unsupervised Learning" only partially describes @alexpghayes's package, notably because our categories effectively aim to describe the general aims of software, whereas in this case that category applies to much of the methodology, while the actual aim remains arguably beyond that scope. We will therefore be particularly interested in hearing your thoughts on the applicability or otherwise of the category-specific standards in this case.

To sum up, please post the following in this issue:

  1. Your 2 itemized lists of standards which the software initially failed to meet, along with an indication of which of that subset of standards the most recent version then actually meets. Please provide git hashes for the repository head in each case.
  2. An indication of any aspects of the software which you think are not addressed by the current standards yet could (potentially) be.

Once you've done that, we'll ask to you proceed to a more general review of the software, for which we'll provide more detail at that time. Thanks all for agreeing to be part of this!

Due date

We would like to have this review phase completed within 4 weeks, so by the 13th of November 2020. We accordingly suggest that you aim to have the first of the two tasks completed within two weeks, by the 30th October.

Could you both please also record approximately how much time you have spent on each review stage. Thank you!

@mpadge
Copy link
Member Author

mpadge commented Oct 20, 2020

Update for reviewers @bbolker and @topepo, note that this repo now includes an R package which enables you to get a pre-formatted checklist for your reviews (inspired by, and with gratitude to, co-board member @stephaniehicks) by running the following lines:

remotes::install_github("ropenscilabs/statistical-software-review")
library(statsoftrev) # the name of the package
rssr_standards_checklist (category = "unsupervised")

That will produce a markdown-formatted checklist in your clipboard ready to paste where you like, or you can use a filename parameter to specify a local file.

ping @noamross so you'll be notified of these conversations.

@bbolker
Copy link

bbolker commented Oct 20, 2020

OK, will do. Note that I had to use githubinstall::githubinstall("pkgapi") (this package searches github for relevant repos rather than requiring you to know the repo owner in advance) in order to install pkgapi: remotes::install_github() doesn't know how to get arbitrary dependencies from GitHub, only from CRAN.

@bbolker
Copy link

bbolker commented Oct 20, 2020

also, there's a typo "funtions" in the packgraph messages shown in the first comment above (sorry)

@bbolker
Copy link

bbolker commented Dec 1, 2020

General Standards

Documentation

  • G1.0 Statistical Software should list at least one primary reference from published academic literature.

In vignettes/README/DESCRIPTION only; could be more prominent?

Statistical Terminology

  • [ x ] G1.1 All statistical terminology should be clarified and unambiguously defined.

Not sure about this. Uses standard definitions (the bulk of this package is mathematical and algorithmic rather than statistical)

Function-level Documentation

  • G1.2 Software should use roxygen2 to document all functions.
  • G1.2a All internal (non-exported) functions should also be documented in standard roxygen2 format, along with a final @noRd tag to suppress automatic generation of .Rd files.

Supplementary Documentation

  • [ NA ] G1.3 Software should include all code necessary to reproduce results which form the basis of performance claims made in associated publications.

no associated pubs [yet? vignette looks like a ms in early stages of prep: readme says " the vignettes are currently scratch work for reference by the developers and are not yet ready for general consumption."

  • G1.4 Software should include code necessary to compare performance claims with alternative implementations in other R packages.

[README says " In simulations fastadi often outperforms softImpute by a small margin.", but I don't know if/where this code lives]

Input Structures

Uni-variate (Vector) Input

  • [ x ] G2.0 Implement assertions on lengths of inputs, particularly through asserting that inputs expected to be single- or multi-valued are indeed so.
  • G2.0a Provide explicit secondary documentation of any expectations on lengths of inputs
  • G2.1 Implement assertions on types of inputs (see the initial point on nomenclature above).
    generally yes, but e.g. adaptive_impute(...,max_iter=20.5) runs without error; also adaptive_impute(..., additional=c(20,25))
  • [ x ] G2.1a Provide explicit secondary documentation of expectations on data types of all vector inputs.
  • [ ? ] G2.2 Appropriately prohibit or restrict submission of multivariate input to parameters expected to be univariate.
  • G2.3 For univariate character input:
  • [ x ] G2.3a Use match.arg() or equivalent where applicable to only permit expected values.
  • G2.3b Either: use tolower() or equivalent to ensure input of character parameters is not case dependent; or explicitly document that parameters are strictly case-sensitive.
  • [NA] G2.4 Provide appropriate mechanisms to convert between different data types, potentially including:
  • G2.4a explicit conversion to integer via as.integer()
  • G2.4b explicit conversion to continuous via as.numeric()
  • G2.4c explicit conversion to character via as.character() (and not paste or paste0)
  • G2.4d explicit conversion to factor via as.factor()
  • G2.4e explicit conversion from factor via as...() functions
  • G2.5 Where inputs are expected to be of factor type, secondary documentation should explicitly state whether these should be ordered or not, and those inputs should provide appropriate error or other routines to ensure inputs follow these expectations.

Tabular Input

  • [NA] G2.6 Software should accept as input as many of the above standard tabular forms as possible, including extension to domain-specific forms.
  • G2.7 Software should provide appropriate conversion or dispatch routines as part of initial pre-processing to ensure that all other sub-functions of a package receive inputs of a single defined class or type.
  • G2.8 Software should issue diagnostic messages for type conversion in which information is lost (such as conversion of variables from factor to character; standardisation of variable names; or removal of meta-data such as those associated with sf-format data) or added (such as insertion of variable or column names where none were provided).
  • G2.9 Software should ensure that extraction or filtering of single columns from tabular inputs should not presume any particular default behaviour, and should ensure all column-extraction operations behave consistently regardless of the class of tabular data used as input.

Software only accepts sparse matrices.

Missing or Undefined Values

  • G2.10 Statistical Software should implement appropriate checks for missing data as part of initial pre-processing prior to passing data to analytic algorithms.
  • G2.11 Where possible, all functions should provide options for users to specify how to handle missing (NA) data, with options minimally including:
  • G2.11a error on missing data
  • G2.11b ignore missing data with default warnings or messages issued
  • G2.11c replace missing data with appropriately imputed values
  • G2.12 Functions should never assume non-missingness, and should never pass data with potential missing values to any base routines with default na.rm = FALSE-type parameters (such as mean(), sd() or cor()).
  • G2.13 All functions should also provide options to handle undefined values (e.g., NaN, Inf and -Inf), including potentially ignoring or removing such values.

Inf or NA values give TridiagEigen: eigen decomposition failed. This is an imputation method, but I don't actually understand enough about the area to know how missing values are filled? I guess missing values are coded as structural zeros???

Output Structures

  • [NA] G3.0 Statistical Software which enables outputs to be written to local files should parse parameters specifying file names to ensure appropriate file suffices are automatically generated where not provided.

Testing

Test Data Sets

  • [NA] G4.0 Where applicable or practicable, tests should use standard data sets with known properties (for example, the NIST Standard Reference Datasets, or data sets provided by other widely-used R packages).
  • [ x ] G4.1 Data sets created within, and used to test, a package should be exported (or otherwise made generally available) so that users can confirm tests and run examples.

Responses to Unexpected Input

  • G4.2 Appropriate error and warning behaviour of all functions should be explicitly demonstrated through tests. In particular,
  • G4.2a Every message produced within R code by stop(), warning(), message(), or equivalent should be unique
    multiple "This should not happen", "max_iter must be an integer >= 1L" errors, "Reached maximum allowed iterations" warnings?
  • G4.2b Explicit tests should demonstrate conditions which trigger every one of those messages, and should compare the result with expected values.
  • G4.3 For functions which are expected to return objects containing no missing (NA) or undefined (NaN, Inf) values, the absence of any such values in return objects should be explicitly tested.

Algorithm Tests

  • G4.4 Correctness tests to test that statistical algorithms produce expected results to some fixed test data sets (potentially through comparisons using binding frameworks such as RStata).
  • G4.4a For new methods, it can be difficult to separate out correctness of the method from the correctness of the implementation, as there may not be reference for comparison. In this case, testing may be implemented against simple, trivial cases or against multiple implementations such as an initial R implementation compared with results from a C/C++ implementation.
  • G4.4b For new implementations of existing methods, correctness tests should include tests against previous implementations. Such testing may explicitly call those implementations in testing, preferably from fixed-versions of other software, or use stored outputs from those where that is not possible.
  • G4.4c Where applicable, stored values may be drawn from published paper outputs when applicable and where code from original implementations is not available
  • G4.5 Correctness tests should be run with a fixed random seed
  • G4.6 Parameter recovery tests to test that the implementation produce expected results given data with known properties. For instance, a linear regression algorithm should return expected coefficient values for a simulated data set generated from a linear model.
  • G4.6a Parameter recovery tests should generally be expected to succeed within a defined tolerance rather than recovering exact values.
  • G4.6b Parameter recovery tests should be run with multiple random seeds when either data simulation or the algorithm contains a random component. (When long-running, such tests may be part of an extended, rather than regular, test suite; see G4.10-4.12, below).
  • G4.7 Algorithm performance tests to test that implementation performs as expected as properties of data change. For instance, a test may show that parameters approach correct estimates within tolerance as data size increases, or that convergence times decrease for higher convergence thresholds.
  • G4.8 Edge condition tests to test that these conditions produce expected behaviour such as clear warnings or errors when confronted with data with extreme properties including but not limited to:
  • G4.8a Zero-length data
  • G4.8b Data of unsupported types (e.g., character or complex numbers in for functions designed only for numeric data)
  • G4.8c Data with all-NA fields or columns or all identical fields or columns
  • G4.8d Data outside the scope of the algorithm (for example, data with more fields (columns) than observations (rows) for some regression algorithms)
  • G4.9 Noise susceptibility tests Packages should test for expected stochastic behaviour, such as through the following conditions:
  • G4.9a Adding trivial noise (for example, at the scale of .Machine$double.eps) to data does not meaningfully change results
  • G4.9b Running under different random seeds or initial conditions does not meaningfully change results

Extended tests

  • [NA] G4.10 Extended tests should included and run under a common framework with other tests but be switched on by flags such as as a <MYPKG>_EXTENDED_TESTS=1 environment variable.
  • G4.11 Where extended tests require large data sets or other assets, these should be provided for downloading and fetched as part of the testing workflow.
  • G4.11a When any downloads of additional data necessary for extended tests fail, the tests themselves should not fail, rather be skipped and implicitly succeed with an appropriate diagnostic message.
  • G4.12 Any conditions necessary to run extended tests such as platform requirements, memory, expected runtime, and artefacts produced that may need manual inspection, should be described in developer documentation such as a CONTRIBUTING.md or tests/README.md file.

Dimensionality Reduction, Clustering, and Unsupervised Learning Standards

Input Data Structures and Validation

  • UL1.0 Unsupervised Learning Software should explicitly document expected format (types or classes) for input data, including descriptions of types or classes which are not accepted; for example, specification that software accepts only numeric inputs in vector or matrix form, or that all inputs must be in data.frame form with both column and row names.
  • UL1.1 Unsupervised Learning Software should provide distinct sub-routines to assert that all input data is of the expected form, and issue informative error messages when incompatible data are submitted.
    this is achieved through S3 methods
  • [NA] UL1.2 Unsupervised learning which uses row or column names to label output objects should assert that input data have non-default row or column names, and issue an informative message when these are not provided. (Such messages need not necessarily be provided by default, but should at least be optionally available.)
  • UL1.3 Unsupervised Learning Software should transfer all relevant aspects of input data, notably including row and column names, and potentially information from other attributes(), to corresponding aspects of return objects.
  • [NA] UL1.3a Where otherwise relevant information is not transferred, this should be explicitly documented.
  • UL1.4 Unsupervised Learning Software should explicitly document whether input data may include missing values.
  • UL1.5 Functions in Unsupervised Learning Software which do not admit input data with missing values should provide informative error messages when data with missing values are submitted.
    TridiagEigen error
  • [NA] UL1.6 Unsupervised Learning Software should document any assumptions made with regard to input data; for example assumptions about distributional forms or locations (such as that data are centred or on approximately equivalent distributional scales). Implications of violations of these assumptions should be both documented and tested, in particular:
  • [NA] UL1.6a Software which responds qualitatively differently to input data which has components on markedly different scales should explicitly document such differences, and implications of submitting such data.
  • UL1.6b Examples or other documentation should not use scale() or equivalent transformations without explaining why scale is applied, and explicitly illustrating and contrasting the consequences of not applying such transformations.

Pre-processing and Variable Transformation

  • [?] UL2.0 Routines likely to give unreliable or irreproducible results in response to violations of assumptions regarding input data (see UL1.6) should implement pre-processing steps to diagnose potential violations, and issue appropriately informative messages, and/or include parameters to enable suitable transformations to be applied (such as the center and scale. parameters of the stats::prcomp() function).
  • [NA] UL2.1 Unsupervised Learning Software should document any transformations applied to input data, for example conversion of label-values to factor, and should provide ways to explicitly avoid any default transformations (with error or warning conditions where appropriate).
  • [NA] UL2.2 For Unsupervised Learning Software which accepts missing values in input data, functions should implement explicit parameters controlling the processing of missing values, ideally distinguishing NA or NaN values from Inf values (for example, through use of na.omit() and related functions from the stats package).
  • [NA] UL2.3 Unsupervised Learning Software should implement pre-processing routines to identify whether aspects of input data are perfectly collinear.

Documentation points out that results will be unreliable if a rank is chosen for approximation that is approximately equal to, or greater than, the rank of the input. (Testing the rank of a large matrix is expensive and probably impractical for typical input data; don't know if it is worth pointing out Matrix::rankMatrix() or not

Algorithms

Labelling

  • [NA] UL3.1 Algorithms which apply sequential labels to input data (such as clustering or partitioning algorithms) should ensure that the sequence follows decreasing group sizes (so labels of "1", "a", or "A" describe the largest group, "2", "b", or "B" the second largest, and so on.)
  • [NA] UL3.2 Dimensionality reduction or equivalent algorithms which label dimensions should ensure that that sequences of labels follows decreasing "importance" (for example, eigenvalues or variance contributions).
  • [NA] UL3.3 Unsupervised Learning Software for which input data does not generally include labels (such as array-like data with no row names) should provide an additional parameter to enable cases to be labelled.

Prediction

  • [NA] UL3.4 Where applicable, Unsupervised Learning Software should implement routines to predict the properties (such as numerical ordinates, or cluster memberships) of additional new data without re-running the entire algorithm.

Group Distributions and Associated Statistics

  • [NA] UL3.5 Objects returned from Unsupervised Learning Software which labels, categorise, or partitions data into discrete groups should include, or provide immediate access to, quantitative information on intra-group variances or equivalent, as well as on inter-group relationships where applicable.

Return Results

  • [NA] UL4.0 Unsupervised Learning Software should return some form of "model" object, generally through using or modifying existing class structures for model objects, or creating a new class of model objects.
  • [NA] UL4.1 Unsupervised Learning Software may enable an ability to generate a model object without actually fitting values. This may be useful for controlling batch processing of computationally intensive fitting algorithms.
  • UL4.2 The return object from Unsupervised Learning Software should include, or otherwise enable immediate extraction of, all parameters used to control the algorithm used.

Reporting Return Results

  • UL4.2 Model objects returned by Unsupervised Learning Software should implement or appropriately extend a default print method which provides an on-screen summary of model (input) parameters and methods used to generate results. The print method may also summarise statistical aspects of the output data or results.
  • UL4.2a The default print method should always ensure only a restricted number of rows of any result matrices or equivalent are printed to the screen.
  • UL4.3 Unsupervised Learning Software should also implement summary methods for model objects which should summarise the primary statistics used in generating the model (such as numbers of observations, parameters of methods applied). The summary method may also provide summary statistics from the resultant model.

Documentation

Visualization

  • [NA] UL6.0 Objects returned by Unsupervised Learning Software should have default plot methods, either through explicit implementation, extension of methods for existing model objects, through ensuring default methods work appropriately, or through explicit reference to helper packages such as factoextra and associated functions.
  • [NA] UL6.1 Where the default plot method is NOT a generic plot method dispatched on the class of return objects (that is, through a plot.<myclass> function), that method dispatch should nevertheless exist in order to explicitly direct users to the appropriate function.
  • [NA] UL6.2 Where default plot methods include labelling components of return objects (such as cluster labels), routines should ensure that labels are automatically placed to ensure readability, and/or that appropriate diagnostic messages are issued where readability is likely to be compromised (for example, through attempting to place too many labels).

Testing

  • UL7.0 Inappropriate types of input data are rejected with expected error messages.

Input Scaling

  • UL7.1 Tests should demonstrate that violations of assumed input properties yield unreliable or invalid outputs, and should clarify how such unreliability or invalidity is manifest through the properties of returned objects.

Output Labelling

  • [NA] UL7.2 Demonstrate that labels placed on output data follow decreasing group sizes (UL3.1)
  • UL7.3 Demonstrate that labels on input data are propagated to, or may be recovered from, output data (see UL3.3).

Prediction

  • [NA] UL7.4 Demonstrate that submission of new data to a previously fitted model can generate results more efficiently than initial model fitting.

Batch Processing

  • [NA] UL7.5 Batch processing routines should be explicitly tested, commonly via extended tests (see G4.10--G4.12).
  • [NA] UL7.5a Tests of batch processing routines should demonstrate that equivalent results are obtained from direct (non-batch) processing.

@bbolker
Copy link

bbolker commented Dec 1, 2020

  • Documentation is clearly present, but presupposes much more contextual knowledge than I have (e.g. that structural zeros in a sparse matrix are assumed to be missing data?)
  • Documentation is fairly good at presenting limitations of the approach (e.g. rank requirements)
  • Package is algorithmic, which simplifies much of the type-checking. Don't know if there are any subclasses in the sparseMatrix zoo of classes that will cause problems?? Presumably these will throw TridiagEigen: eigen decomposition failed errors ...
  • the 'lifecycle' badge on GitHub says 'maturing', but some aspects seem alpha-like to me, for example relatively verbose (and unsilenceable?)
  • tests have some warnings (3 warn/38 pass)
  • there are lots of loose ends in the documentation; in particular, there are no vignettes, and (as stated above) a high level of background knowledge is assumed. This may be OK, depending on the audience? Most of the contextual information beyond the basic function documentation is in the DESCRIPTION (specifically in the "Description:" field), accessible via ?fastadi, the README file (not included with the package), or the vignettes (which are not installed: "the vignettes are currently scratch work for reference by the developers and are not yet ready for general consumption.").
  • As far as I can tell the package/based on the style, the package seems technically sound - it's basically an implementation of a reasonably straightforward (?) algorithm - although the vignettes suggest that the developers are still considering a lot of technical issues. I don't know if (1) the vignettes are out of date and represent problems that have already been solved; (2) the vignettes suggest issues that would prevent good use of the package in its current form; (3) the vignettes are aspirational, i.e. they represent fundamental technical problems that should be solved in order to improve the speed/scalability of the current implementation.
  • the tests seem OK, but I have a hard time judging their completeness due to my lack of technical knowledge/context

MINOR comments

  • installs OK via remotes::install_github()
  • depends on another only-on-Github package (LMRF)
  • go to help.start()/HTML help
    • using L integer notation in help may be confusing?
  • do we really need a default method that returns an error? (I guess.)
  • should adaptive_initialize have a 'quietly' setting?
  • why does example("adaptive_initialize") say 'rank 5' in the comments when it looks like this is a rank-3 approximation?
  • docs for "CitationImpute/citation_impute"
    • does "missing value" mean "implicit zero" here?
    • epsilon should be "criterion", not "criteria" ?
    • additional: precise -> precision
  • adaptive_impute: "Explicit (observed) zeroes in ‘X’ can be dropped for" ??
    • "isn't know[n]"

@alexpghayes
Copy link

This feedback is very useful, thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants