Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add parametric UMAP model family (#688)
* Add src/vak/prep/unit_dataset/ with unit_dataset.py * Add vak/prep/dimensionality_reduction/ with prep_dimensionality_reduction_dataset function * Import new modules in vak/prep/__init__.py * Remove parameter from prep_frame_classification dataset docstring, not a parameter of this function * Rename 'vak.prep.split.dataframe' -> 'vak.prep.split.frame_classification_dataframe', and add function 'vak.prep.split.unit_dataframe' * Use renamed 'split.frame_classification_dataframe' in vak.prep.frame_classification.prep_frame_classification_dataset * Fix typo in src/vak/prep/split/split.py * Remove wrong type hint in src/vak/prep/unit_dataset/unit_dataset.py * Add vak/prep/dimensionality_reduction/dataset_arrays.py with function 'move_files_into_split_subdirs' * Add src/vak/datasets/dimensionality_reduction/ with unit_dataset.py and metadata.py * Add initial UnitDataset class, fix imports in datasets/dimensionality_reduction/__init__.py * Import dataset_arrays module in src/vak/prep/dimensionality_reduction/__init__.py * Import dimensionality_reduction in vak/datasets/__init__.py * Fix typo in src/vak/datasets/dimensionality_reduction/__init__.py * Fix `vak.prep.dimensionality_reduction.prep_dimensionality_reduction_dataset` to use `dataset_arrays.move_array_files_into_split_dirs` and `Metadata` * Remove wrong import from src/vak/datasets/dimensionality_reduction/unit_dataset.py * Fix pad_spectrogram to re-save file after padding * Add vak/nn/loss/umap.py * Add src/vak/datasets/dimensionality_reduction/parametric_umap/ * Add vak/nets/conv_encoder.py * Fix dataset class in vak/datasets/dimensionality_reduction/parametric_umap/parametric_umap.py * Import ParametricUMAPDataset in vak.datasets * Import umap_loss in src/vak/nn/loss/__init__.py * Add src/vak/models/parametric_umap_model.py * Add src/vak/models/convencoder_parametric_umap.py * Import ParametricUMAPModel and ConvEncoderParametricUMAP in src/vak/models/__init__.py * Import conv_encoder and ConvEncoder in vak/nets/__init__.py * Add shape property to ParametricUMAPDataset * Add UmapLoss class to vak/nn/loss/umap.py * Import functional and import * from .loss and .modules in vak/nn/__init__.py * Fix how we get decoder from network dict in ParametricUMAPModel: use get, default to None * Fix ConvEncoderParametricUMAP to only specify 'encoder' in network dict, and to use UmapLoss class for loss * Add umap-learn and pynndescent as dependencies * Remove out-dated parameter from docstring in train/frame_classification.py * Add vak/train/parametric_umap.py * Fix vak/train/train.py to call train_parametric_umap_model when model family is ParametricUMAPModel * Add 'dimensionality reduction' to DATASET_TYPE_FUNCTION_MAP in vak.prep.constants * Fix vak/prep/prep.py so it will call prep_dimensionality_reduction_dataset appropriately * Fix parameter of parametric umap dataset: 'Euclidean' -> 'euclidean' * Add duration property to ParametricUMAPDataset * Rename vak/transforms/defaults.py -> frame_classification.py, add 'get_default_frame_classification_transforms' helper function * Import registry in models/__init__.py, add to __all__ there * Add transforms/defaults/parametric_umap.py * Add valid transform_kwarg key-value pairs to docstring of get_default_frame_classification transforms * Add transforms/defaults/get.py * Rename transforms.defaults.get.get_defaults -> get_default_transform * Add transforms/defaults/__init__.py with imports * Fix train/frame_classification.py to use transforms.defaults.get_default_transform * Fix eval/frame_classification.py to use transforms.defaults.get_default_transform * Fix predict/frame_classification.py to use transforms.defaults.get_default_transform * Fix imports in transforms/__init__.py, just import defaults not get_defaults from defaults * Fixup src/vak/eval/frame_classification.py * Fixup src/vak/predict/frame_classification.py * Fixup src/vak/train/frame_classification.py * Fix 'get_default_frame_classification_transform' to access transform_kwargs with get and have a default * Fix function name in 'make_learncurve_splits_from_dataset_df': 'split.dataframe' -> 'split.frame_classification_dataframe' * Remove argument 'spect_key' in cli/predict that was removed from vak.predict.predict * Refactor/fix script that generates the 'generated' test data so we are not always running the prep step, that takes a really long time, and so that we actually run eval/predict/train_continue configs - add a CLI with argparse - and add option to run either prep step, results step, or all - and add option to specify commands to run after prep - and add option to be require only one results directory per train config or to instead just use the most recent one * Fix name in predict/frame_classificaton.py: -> 'datasets.frame_classification.metadata.Metadata' * Fix frames dataset so frames_labels_paths is None when split is 'predict' * Use dict get method with transform_kwargs for PredictItemTransform in transforms/defaults/frame_classification.py * Fix how we add spect_format to metadata in prep/frame_classification/frame_classification.py * Fix how we determine source_paths for input_type 'spect' in FramesDataset * Fix arg name in predict/frame_classification.py * Add print statements to tests/scripts/generate_data_for_tests.py so we can more easily troubleshoot * Modify generate_data_for_tests script so it only preps datasets once * Rename configs in test data so model name is capitalized * Update tests/data_for_tests/configs/configs.json - Capitalize model names - Add field "use_dataset_from_config" that points to another config from which the dataset path should be used. This avoids doing a bunch of filtering logic inside the script for generating test data in favor of a declarative approach * Add package tests/scripts/vaktestdata, refactor giant script * In configs used for test data, capitalize model names in directory paths * Rename top-level field -> 'config_metadata' in metadata.json, add missing field 'use_dataset_from_config' for one entry * Add ConfigMetadata dataclass in vaktestdata/config_metadata.py * Add log message to tests/scripts/vaktestdata/dirs.py * Modify vaktestdata.configs.copy_config_files to make GENERATED_TEST_CONFIGS_ROOT and to print a log message * Modify constants so it has a list of ConfigMetadata instances made from configs.json * Modify vaktestdata.prep.run_prep to use CONFIG_METADATA to only run prep for configs that have a null 'use_dataset_from_config' field * Fix formatting errors in tests/data_for_tests/configs/configs.json * Add missing field/attribute 'model' to ConfigMetadata * Rewrite `vaktestdata.configs.add_dataset_path_from_prepped_configs` to use CONFIG_METADATA * Use logger in tests/scripts/vaktestdata/prep.py * Use logger in tests/scripts/vaktestdata/dirs.py * Use logger, get name of config section correctly, and import/use pathlib where needed in tests/scripts/vaktestdata/configs.py * Add default for parser arg '--commands' in tests/scripts/vaktestdata/parser.py * Rewrite tests/scripts/generate_data_for_tests.py to use vaktestdata package * Rename ConvEnconderParametricUMAP -> ConvEncoderUMAP (the fact that it's parametric is implied) * Add: tests/data_for_tests/configs/ConvEncoderUMAP_train_audio_cbin_annot_notmat.toml * Add ConvEncoderUMAP_train_audio_cbin_annot_notmat.toml to configs.json * Add args + attributes to UMAPLoss class * Revise vak.models.parametric_umap_model - Remove loss function kwargs from model __init__ now that they are args/attributes of the loss class itself, will pass in with config - rewrite train/val steps of model to use the rewritten loss class - Add ParametricUMAP class with fit/transform methods, mirroring the UMAP-learn API, plan to just use this class inside `vak.train.parametric_umap`, `vak.eval.parametric_umap` etc. - Also add a DataModule like in Tim's code, not sure yet if I need this * Move parametric UMAP dataset up into vak.datasets, get rid of dimensionality_reduction sub-package * Make minor fixes to ParametricUMAP class * WIP: Fix vak/train/parametric_umap.py so it actually works * Add 'train_dataset_params' and 'val_dataset_params' as attributes to TrainConfig, and add in valid.toml * Add 'train_transform_params' and 'val_transform_params' as attributes to TrainConfig, and add in valid.toml * Remove `root_results_path` arg from vak.train.train, no longer used * Add args for train/val transform params and train/val dataset params to vak.train.train, pass into vak.train.train_parametric_umap_model * Pass args for train/val transform params and train/val dataset params into vak.train.train from vak.cli.train.train * Add and use args train_transform_params and val_transform_params in vak.train.train_parametric_umap_model * Put args with defaults in correct place in vak.train.train * Fix TrainConfig attribute names: train/val transform kwargs -> transform params * Change type annotation for ParametricUMAPModel parameter network to indicate it must be a dict * Fix where we get Metadata from in ParametricUMAPDatasets.from_dataset_path method * Fix how we get default kwargs for a network definition that's a dict inside Model class * Fix how we get ParametricUMAPModel in vak.models.get * Remove ckpt_step and patience args from call to train_parametric_umap model in vak/train/train.py * Make more fixes to train_parametric_umap_model - Add a get_trainer function instead of re-using one for frame classification models - Fix logging messages about length of dataset * Rewrite train_frame_classification_model to use train/val_transform_params and train/val_dataset_params args * Fix train_parametric_umap_model to use train/val_transform_params * In vak.train.train, pass train/val_transform/dataset_params into train_frame_classification_model * Add train_dataset_params and val_transform_params to frame classification train configs in tests/data_for_tests/configs * Add train/val_transform_params to tests/data_for_tests/configs/ConvEncoderUMAP_train_audio_cbin_annot_notmat.toml * Add train/val_transform_params adn train/val_dataset_params to vak.train.train docstring * fixup Add train/val_transform_params adn train/val_dataset_params to vak.train.train docstring * Fix definitioin in train_frame_classification_model docstring * Add transform/dataset_param options to EVAL and PREDICT sections in valid.toml * Add transform/dataset_params to EvalConfig * Add transform/dataset_params to PredictConfig * Add/use transform/dataset_params in eval_frame_classification_model function * Add/use transform/dataset_params in vak.eval.eval function -- pass into eval_frame_classification_model * Add/use transform/dataset_params in predict_frame_classification_model function * Add/use transform/dataset_params in vak.predict.predict function -- pass into predict_frame_classification_model * Fix definition in vak/train/train.py docstring * Remove src/vak/config/dataloader.py * Remove use of Dataloader in vak/config * Add transform_params options and remove DATALOADER sections in eval/predict configs * Add train/val_transform_params and train/val_dataset_params to vak.learncurve.frame_classification * Add train/val_transform_params and train/val_dataset_params to vak.learncurve.learncurve * Remove import of dataloader in config/config.py * Finish remove dataloader imports from config sub-package * Filter out NumbaDeprecationWarnings triggered by umap * Remove DATALOADER section from two learncurve configs * Remove DATALOADER section from 4 other configs in data_for_tests/configs * Add name 'use_result_from_config' in configs.json * Add attribute `use_result_from_config` to ConfigMetadata * Remove constants from tests/scripts/vaktestdata/constants.py that are no longer used * Rewrite `vaktestdata.configs.fix_options_in_configs` to use declarative config metadata when determining config to use results from * Refactor main loop in tests/scripts/generate_data_for_tests.py * Remove other constants from tests/scripts/vaktestdata/constants.py * Fix how dirs get made in tests/scripts/vaktestdata/dirs.py * Change birdsongrec -> birdsong-recognition-dataset in dir names in test configs * Rename prep/dimensionality_reduction -> prep/parametric_umap * Add missing train/val_dataset/transform_params options to LEARNCURVE table in valid.toml * Add missing train/val_dataset/transform_params options in call to learncurve in cli.learncurve * Remove window size, add transform/dataset_params in call to eval in cli.eval * Remove window size, add transform/dataset_params in call to predict in cli.predict * Remove window_size arg in call to eval_frame_classification_model * fixup Remove window size, add transform/dataset_params in call to predict in cli.predict * Add missing fields to some entries in tests/data_for_tests/configs/config.json * Fix how we handle 'train_continue' command in generate_data_for_tests.py * Make minor fixes to docstring of eval_frame_classification_model * Remove 'batch_size' from eval_frame_classification_model docstring, not a parameter for this function * Remove unused import from train_parametric_umap_model * Add vak/eval/parametric_umap.py * Modify vak/eval/eval.py to call eval_parametric_umap_model when appropriate * Remove 'labelmap_path' from REQUIRED_OPTIONS IN vak/config/parse.py, since it's not required for parametric UMAP models * Fix how we train parametric umap so it saves checkpoints, and in the correct location * Pass 'ckpt_step' into train_parametric_umap inside vak.train.train * Add tests/data_for_tests/configs/ConvEncoderUMAP_eval_audio_cbin_annot_notmat.toml * Add ConvEncoderUMAP_eval_audio_cbin_annot_notmat.toml to tests/data_for_tests/configs/configs.json * Make labelmap_path optional for EvalConfig, so Parametric UMAP models don't crash * Rewrite definition for batch_size in docstring of src/vak/eval/parametric_umap.py * Add batch_size parameter to vak.eval.eval, make labelmap_path parameter default to None, fix parameter order in docstring * Pass batch size from config into vak.eval.eval inside vak.cli.eval * Fix prep section of config so it makes a test split: tests/data_for_tests/configs/ConvEncoderUMAP_eval_audio_cbin_annot_notmat.toml * Fix resize option in ConvEncoderUMAP configs so that unit images are square * Add shape attribute to parametric_umap.Metadata * Revise src/vak/prep/unit_dataset/unit_dataset.py for readability, and make it return the shape of all spectrograms * Get shape returned by prep_unit_dataset inside src/vak/prep/parametric_umap/parametric_umap.py and use with Metadata * Fix parametric_umap.Metadata -- shape attribute is mandatory, needs to come before audio_format * Fix prep_unit_dataset so we actually get shape of spectrograms * Add converter to parametric_umap.Metadata.shape attribute to cast list to tuple when we load from json * Add functions for default padding to src/vak/models/convencoder_umap.py, to use in train.parametric_umap, eval.parametric_umap, etc. * Modify default parametric_umap transform so that it only adds padding transform if 'padding' is in the transform_kwargs * Modify train/parametric_umap to use default padding for ConvEncoderUMAP model * Rewrite default padding for convencoder_umap to round to nearest tens place * Move code block that gets default padding for ConvEncoderUMAP so it's in the right place, before we get the transforms with it * Make fixes in eval/parametric_umap -- get default padding for ConvEncoder UMAP, remove spect_scaler_path since it's not used * Remove passing parameter 'spect_scaler_path' into vak.eval.eval_parametric_umap_model' inside vak.cli.eval * WIP: Add missing docstrings in src/vak/datasets/frame_classification/frames_dataset.py * WIP: Add missing docstrings in src/vak/datasets/frame_classification/window_dataset.py * Revise docstrings in src/vak/datasets/parametric_umap/parametric_umap.py, rename ParametricUMAPDataset -> ParametricUMAPTrainingDatset, and add ParametricUMAPInferenceDataset * WIP: Add src/vak/predict/parametric_umap.py * Rename ParametricUMAPTrainingDataset -> ParametricUMAPDataset * Remove transform_params table from ConvEncoderUMAP configs
- Loading branch information