Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Discussion] Static or Dynamic analysis, .py (with no output) vs ipynb (that includes output) analysis #3

Open
canyon289 opened this issue Oct 7, 2023 · 2 comments

Comments

@canyon289
Copy link
Member

Should we "read" the models as code, or actually have an env where they run?

My gut feel is for simplicity, feasibility, and security is we do static analysis, or at least start here

The downsides are

  • Some models are constructed over multiple python modules which makes single file analysis hard, but I think this is a rare case
  • It makes parsing certain thing likes number of variables hard, especially with dims coords etc
  • Means we cant use the pymc functionality itself for analysis

The upsides are

  • PyMC models aren't really isolated from their env, for dynamic analysis we have to get the data, load it correctly etc and thats a headache
  • No needing to worry about versions or code security
  • Makes it simple for anyone to run themselves
  • Computationally faster

Assuming static analysis this library would then behave like a linter. It gets a reference to a .py or .ipynb file and then it outputs a bunch of metrics

@canyon289
Copy link
Member Author

As another thread @twiecki added ideas of analyzing divergences and and sampling time. We can't really do that with sampling or dynamic analysis, but if someone submits a notebook they've run we could parse that out of the cell output. Consider this to be "notebook" analysis, a third possibility

https://discourse.pymc.io/t/extended-event-gathering-pymc-usage-information/13064/3?u=ravinkumar

@canyon289 canyon289 changed the title [Discussion] Static or Dynamic analysis [Discussion] Static or Dynamic analysis, .py (with no output) vs ipynb (that includes output) analysis Oct 7, 2023
@OriolAbril
Copy link
Member

OriolAbril commented Oct 14, 2023

Here are all the things we gathered on discourse. I have tried to split them into the 3 different analysis areas: static code analysis (requires file where model is in as input), env analysis (requires being executed in the same env where the model is executed) and infdata analysis (requires idata as input directly or its filename plus ArviZ installed). There could also potentially be a 4th area for more "demographic info" which would need to be a form filled by the user, skipping that for now.

Static

  • Which distributions are being used and how often?
  • Which sampling functions are more common? Which defaults are most often modified?
    • Also partially possible via idata analysis checking which groups are present
  • What operations are more common with PyMC’s outputs: plotting with ArviZ, saving to disk, converting to NumPy/Pandas objects…
  • What are most common packages also imported; How many folks use xarray operations, arviz, scipy et
  • Do people use the default sampler, specify their own, or change sampler arguments
  • What are the most common prior parameters
  • Which backend is being used?
  • Use of coords/named dims in models (also combined with InfData section)
  • Use of “basic” PyMC vs specialized sub-modules/associated projects: GP, BART, sun-ode, Bambi, (others?)
    • Also partially done via env analysis
  • PyMC vs PyMC3 from static import analysis

Env

  • How many users are using pymc3 vs pymc
    • Partially doable from static section too, if restricted to pymc3 vs pymc version comparison
  • Versions of related packages: how common is it to have latest pymc but older arviz, numpy...

InfData

  • How big are the models? # of variables being sampled by MCMC? # of observations, how close are we from models that don’t fit in RAM of common computers?
    • Partially possible as static analysis too but quite difficult imo, and again, only partially possible.
  • Number of divergences?
  • Total sampling time?
  • ESS
  • Size of datasets being analyzed
  • Use of coords/named dims in models (also combined with static section)

Demographics

  • Scientific domains/industries where PyMC is being used
  • Types of data being studied (purely cross-sectional, purely time series, longitudinal, Geo-spatial…)
  • Causal identification strategies (if any/applicable)
  • Repos associated with published papers?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants