Skip to content

Commit

Permalink
Merge branch 'experiment/fix_codespell' into 'master'
Browse files Browse the repository at this point in the history
fix codespell

See merge request ai-lab-pmo/mltools/automl/LightAutoML!18
  • Loading branch information
dev-rinchin committed Oct 24, 2024
2 parents 2591a23 + a88ed9a commit a12f81e
Show file tree
Hide file tree
Showing 66 changed files with 19,549 additions and 19,549 deletions.
2 changes: 1 addition & 1 deletion docs/pages/Installation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ Then,
# Create virtual environment inside your project directory
poetry config virtualenvs.in-project true
# If you want to update dependecies, run the command:
# If you want to update dependencies, run the command:
poetry lock
# Installation
Expand Down
38,318 changes: 19,159 additions & 19,159 deletions examples/data/jobs_train.csv

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion examples/optimization/sequential_parameter_search.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# -*- encoding: utf-8 -*-

"""Simple example for sequetial parameter search with OptunaTuner."""
"""Simple example for sequential parameter search with OptunaTuner."""

import copy

Expand Down

Large diffs are not rendered by default.

94 changes: 47 additions & 47 deletions examples/tutorials/Tutorial_11_time_series.ipynb

Large diffs are not rendered by default.

8 changes: 4 additions & 4 deletions examples/tutorials/Tutorial_12_Matching.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -1425,7 +1425,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4. No replacemet matching:\n",
"## 4. No replacement matching:\n",
"\n",
" In order you need to just match groups with no replacement you may use additional method match_no_rep()."
]
Expand Down Expand Up @@ -1458,7 +1458,7 @@
"outputs": [],
"source": [
"# you may specify threshold in order to receive only pair with 5% difference in post_spends\n",
"no_replacemet_df = model.match_no_rep(threshold=0.05) "
"no_replacement_df = model.match_no_rep(threshold=0.05) "
]
},
{
Expand Down Expand Up @@ -1579,7 +1579,7 @@
}
],
"source": [
"no_replacemet_df.head()"
"no_replacement_df.head()"
]
},
{
Expand All @@ -1599,7 +1599,7 @@
}
],
"source": [
"no_replacemet_df.shape"
"no_replacement_df.shape"
]
},
{
Expand Down
4 changes: 2 additions & 2 deletions examples/tutorials/Tutorial_13_ABtesting.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -1106,7 +1106,7 @@
"metadata": {},
"source": [
"To perform experiment that separates samples by groups `group_col` <br>\n",
"and if there is group that shold not be separated `quant_field` can be used"
"and if there is group that should not be separated `quant_field` can be used"
]
},
{
Expand Down Expand Up @@ -1733,7 +1733,7 @@
"### 3.1 Full AB-test\n",
"\n",
"Full (basic) version of test includes calculation of all available metrics, which are: \"diff in means\", \"diff in diff\" and \"cuped\"<br>\n",
"Pay attention, that for \"cuped\" and \"diff in diff\" metrics requred target before pilot."
"Pay attention, that for \"cuped\" and \"diff in diff\" metrics required target before pilot."
]
},
{
Expand Down
8 changes: 4 additions & 4 deletions examples/tutorials/Tutorial_1_basics.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -842,7 +842,7 @@
"source": [
"Note: only logloss loss is available for binary task and it is the default loss. Default metric for binary classification is ROC-AUC. See more info about available and default losses and metrics [here](https://lightautoml.readthedocs.io/en/latest/pages/modules/generated/lightautoml.tasks.base.Task.html#lightautoml.tasks.base.Task). \n",
"\n",
"**Depending on the task, you can and shold choose exactly those metrics and losses that you want and need to optimize.**"
"**Depending on the task, you can and should choose exactly those metrics and losses that you want and need to optimize.**"
]
},
{
Expand Down Expand Up @@ -994,7 +994,7 @@
"\n",
"**Role and types guessing**\n",
"\n",
"Roles can be specified as a string or a specific class object, or defined automatically. For ```TabularAutoML``` preset ```'numeric'```, ```'datetime'``` and ```'category'``` roles can be automatically defined. There are two ways of role defining. **First** is very simple: check if the value can be converted to a date (```'datetime'```), otherwise check if it can be converted to a number (```'numeric'```), otherwise declare it a category (```'categorical'```). But this method may not work well on large data or when encoding categories with integers. The **second** method is based on statistics: the distributions of numerical features are considered, and how similar they are to the distributions of real or categorical value. Also different ways of feature encoding (as a number or as a category) are compared and based on normalized Gini index it is decided which encoding is better. For this case a set of specific rules is created, and if at least one of them is fullfilled, then the feature will be assigned to numerical, otherwise to categorical. This check can be enabled or disabled using the ```advanced_roles``` parameter. \n",
"Roles can be specified as a string or a specific class object, or defined automatically. For ```TabularAutoML``` preset ```'numeric'```, ```'datetime'``` and ```'category'``` roles can be automatically defined. There are two ways of role defining. **First** is very simple: check if the value can be converted to a date (```'datetime'```), otherwise check if it can be converted to a number (```'numeric'```), otherwise declare it a category (```'categorical'```). But this method may not work well on large data or when encoding categories with integers. The **second** method is based on statistics: the distributions of numerical features are considered, and how similar they are to the distributions of real or categorical value. Also different ways of feature encoding (as a number or as a category) are compared and based on normalized Gini index it is decided which encoding is better. For this case a set of specific rules is created, and if at least one of them is fulfilled, then the feature will be assigned to numerical, otherwise to categorical. This check can be enabled or disabled using the ```advanced_roles``` parameter. \n",
"\n",
"If roles are explicitly specified, automatic definition won't be applied to the specified dataset columns. In the case of specifying a role as an object of a certain class, through its arguments, it is possible to set the processing parameters in more detail.\n",
" \n",
Expand Down Expand Up @@ -1571,7 +1571,7 @@
"- Fast (`fast`) - this method uses feature importances from feature selector LGBM model inside LightAutoML. It works extremely fast and almost always (almost because of situations, when feature selection is turned off or selector was removed from the final models with all GBM models). There is no need to use new labelled data.\n",
"- Accurate (`accurate`) - this method calculate *features permutation importances* for the whole LightAutoML model based on the **new labelled data**. It always works but can take a lot of time to finish (depending on the model structure, new labelled dataset size etc.).\n",
"\n",
"In the cell below we will use `automl_rd.model` instead `automl_rd` because we want to take the importances from the model, not from the report. But **be carefull** - everything, which is calculated using `automl_rd.model` will not go to the report."
"In the cell below we will use `automl_rd.model` instead `automl_rd` because we want to take the importances from the model, not from the report. But **be careful** - everything, which is calculated using `automl_rd.model` will not go to the report."
]
},
{
Expand Down Expand Up @@ -4433,7 +4433,7 @@
"id": "0b1f7773",
"metadata": {},
"source": [
"### Multi-label classifcation\n",
"### Multi-label classification\n",
"\n",
"Now let's consider multi-label classification task, here you will use the same dataset as in section above (Anuran Calls (MFCCs) Data Set ). Let's pick labels in each column:"
]
Expand Down
2 changes: 1 addition & 1 deletion examples/tutorials/Tutorial_2_WhiteBox_AutoWoE.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -756,7 +756,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"### AutoWoE - simplier model"
"### AutoWoE - simpler model"
]
},
{
Expand Down
52 changes: 26 additions & 26 deletions examples/tutorials/Tutorial_4_NLP_Interpretation.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"<img src=\"../../imgs/LightAutoML_logo_big.png\" alt=\"LightAutoML logo\" style=\"width:100%;\"/>"
"<img src=\"../../docs/imgs/lightautoml_logo_color.png\" alt=\"LightAutoML logo\" style=\"width:100%;\"/>"
]
},
{
Expand All @@ -35,7 +35,7 @@
"metadata": {},
"source": [
"The last years deep neural networks / gradient boosting / ensembles of models allow to improve the soulution quality of many application task in field of natural language processing (NLP). The indicators of this improvement describe the partial behavior of the model and can hide errors, for example, errors in the construction of the model, errors in data collection. All this can be critical in tasks related to the processing of medical, forensic, banking data.\n",
"In this tutorial we will check the NLP interpetation module of automl."
"In this tutorial we will check the NLP interpretation module of automl."
]
},
{
Expand Down Expand Up @@ -83,7 +83,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Dowload data"
"## Download data"
]
},
{
Expand Down Expand Up @@ -377,12 +377,12 @@
"[11:22:30] - CPU: 1 cores\n",
"[11:22:30] - memory: 16 GB\n",
"\n",
"[11:22:30] \u001B[1mTrain data shape: (127656, 8)\u001B[0m\n",
"[11:22:30] \u001b[1mTrain data shape: (127656, 8)\u001b[0m\n",
"\n",
"[11:22:30] Layer \u001B[1m1\u001B[0m train process start. Time left 3599.85 secs\n",
"[11:22:31] Start fitting \u001B[1mLvl_0_Pipe_0_Mod_0_TorchNN\u001B[0m ...\n",
"[11:22:30] Layer \u001b[1m1\u001b[0m train process start. Time left 3599.85 secs\n",
"[11:22:31] Start fitting \u001b[1mLvl_0_Pipe_0_Mod_0_TorchNN\u001b[0m ...\n",
"[11:22:31] Training params: {'bs': 32, 'num_workers': 1, 'max_length': 128, 'opt_params': {'lr': 1e-05}, 'scheduler_params': {'patience': 5, 'factor': 0.5, 'verbose': True}, 'is_snap': False, 'snap_params': {'k': 1, 'early_stopping': True, 'patience': 1, 'swa': False}, 'init_bias': True, 'n_epochs': 7, 'input_bn': False, 'emb_dropout': 0.1, 'emb_ratio': 3, 'max_emb_size': 50, 'bert_name': 'prajjwal1/bert-tiny', 'pooling': 'cls', 'device': device(type='cuda', index=0), 'use_cont': True, 'use_cat': True, 'use_text': True, 'lang': 'en', 'deterministic': False, 'multigpu': False, 'random_state': 42, 'path_to_save': None, 'verbose_inside': None, 'verbose': 1, 'device_ids': None, 'n_out': 1, 'cat_features': [], 'cat_dims': [], 'cont_features': [], 'cont_dim': 0, 'text_features': ['concated__comment_text'], 'bias': array([[-2.24401446]])}\n",
"[11:22:31] ===== Start working with \u001B[1mfold 0\u001B[0m for \u001B[1mLvl_0_Pipe_0_Mod_0_TorchNN\u001B[0m =====\n",
"[11:22:31] ===== Start working with \u001b[1mfold 0\u001b[0m for \u001b[1mLvl_0_Pipe_0_Mod_0_TorchNN\u001b[0m =====\n",
"[11:22:36] number of text features: 1 \n",
"[11:22:36] number of categorical features: 0 \n",
"[11:22:36] number of continuous features: 0 \n"
Expand Down Expand Up @@ -497,7 +497,7 @@
"output_type": "stream",
"text": [
"[11:45:22] Epoch: 6, train loss: 0.09056100249290466, val loss: 0.10337436944246292, val metric: 0.9788043902058639\n",
"[11:45:23] ===== Start working with \u001B[1mfold 1\u001B[0m for \u001B[1mLvl_0_Pipe_0_Mod_0_TorchNN\u001B[0m =====\n",
"[11:45:23] ===== Start working with \u001b[1mfold 1\u001b[0m for \u001b[1mLvl_0_Pipe_0_Mod_0_TorchNN\u001b[0m =====\n",
"[11:45:28] number of text features: 1 \n",
"[11:45:28] number of categorical features: 0 \n",
"[11:45:28] number of continuous features: 0 \n"
Expand Down Expand Up @@ -625,7 +625,7 @@
"name": "stdout",
"output_type": "stream",
"text": [
"[12:07:40] ===== Start working with \u001B[1mfold 2\u001B[0m for \u001B[1mLvl_0_Pipe_0_Mod_0_TorchNN\u001B[0m =====\n",
"[12:07:40] ===== Start working with \u001b[1mfold 2\u001b[0m for \u001b[1mLvl_0_Pipe_0_Mod_0_TorchNN\u001b[0m =====\n",
"[12:07:44] number of text features: 1 \n",
"[12:07:44] number of categorical features: 0 \n",
"[12:07:44] number of continuous features: 0 \n"
Expand Down Expand Up @@ -762,15 +762,15 @@
"name": "stdout",
"output_type": "stream",
"text": [
"[12:30:22] Fitting \u001B[1mLvl_0_Pipe_0_Mod_0_TorchNN\u001B[0m finished. score = \u001B[1m0.9782371823652668\u001B[0m\n",
"[12:30:22] \u001B[1mLvl_0_Pipe_0_Mod_0_TorchNN\u001B[0m fitting and predicting completed\n",
"[12:30:22] Fitting \u001b[1mLvl_0_Pipe_0_Mod_0_TorchNN\u001b[0m finished. score = \u001b[1m0.9782371823652668\u001b[0m\n",
"[12:30:22] \u001b[1mLvl_0_Pipe_0_Mod_0_TorchNN\u001b[0m fitting and predicting completed\n",
"[12:30:22] Time left -472.15 secs\n",
"\n",
"[12:30:22] Time limit exceeded. Last level models will be blended and unused pipelines will be pruned.\n",
"\n",
"[12:30:22] \u001B[1mLayer 1 training completed.\u001B[0m\n",
"[12:30:22] \u001b[1mLayer 1 training completed.\u001b[0m\n",
"\n",
"[12:30:22] \u001B[1mAutoml preset training completed in 4072.15 seconds\u001B[0m\n",
"[12:30:22] \u001b[1mAutoml preset training completed in 4072.15 seconds\u001b[0m\n",
"\n",
"[12:30:22] Model description:\n",
"Final prediction for new objects (level 0) = \n",
Expand Down Expand Up @@ -1918,11 +1918,11 @@
"[12:38:00] - CPU: 1 cores\n",
"[12:38:00] - memory: 16 GB\n",
"\n",
"[12:38:00] \u001B[1mTrain data shape: (80000, 7)\u001B[0m\n",
"[12:38:00] \u001b[1mTrain data shape: (80000, 7)\u001b[0m\n",
"\n",
"[12:38:01] Layer \u001B[1m1\u001B[0m train process start. Time left 3599.63 secs\n",
"[12:38:01] Start fitting \u001B[1mLvl_0_Pipe_0_Mod_0_TorchNN\u001B[0m ...\n",
"[12:38:01] ===== Start working with \u001B[1mfold 0\u001B[0m for \u001B[1mLvl_0_Pipe_0_Mod_0_TorchNN\u001B[0m =====\n"
"[12:38:01] Layer \u001b[1m1\u001b[0m train process start. Time left 3599.63 secs\n",
"[12:38:01] Start fitting \u001b[1mLvl_0_Pipe_0_Mod_0_TorchNN\u001b[0m ...\n",
"[12:38:01] ===== Start working with \u001b[1mfold 0\u001b[0m for \u001b[1mLvl_0_Pipe_0_Mod_0_TorchNN\u001b[0m =====\n"
]
},
{
Expand All @@ -1942,7 +1942,7 @@
"name": "stdout",
"output_type": "stream",
"text": [
"[13:07:23] ===== Start working with \u001B[1mfold 1\u001B[0m for \u001B[1mLvl_0_Pipe_0_Mod_0_TorchNN\u001B[0m =====\n"
"[13:07:23] ===== Start working with \u001b[1mfold 1\u001b[0m for \u001b[1mLvl_0_Pipe_0_Mod_0_TorchNN\u001b[0m =====\n"
]
},
{
Expand All @@ -1964,15 +1964,15 @@
"text": [
"[13:36:29] Time limit exceeded after calculating fold 1\n",
"\n",
"[13:36:29] Fitting \u001B[1mLvl_0_Pipe_0_Mod_0_TorchNN\u001B[0m finished. score = \u001B[1m-0.46728458911890136\u001B[0m\n",
"[13:36:29] \u001B[1mLvl_0_Pipe_0_Mod_0_TorchNN\u001B[0m fitting and predicting completed\n",
"[13:36:29] Fitting \u001b[1mLvl_0_Pipe_0_Mod_0_TorchNN\u001b[0m finished. score = \u001b[1m-0.46728458911890136\u001b[0m\n",
"[13:36:29] \u001b[1mLvl_0_Pipe_0_Mod_0_TorchNN\u001b[0m fitting and predicting completed\n",
"[13:36:29] Time left 91.29 secs\n",
"\n",
"[13:36:29] Time limit exceeded in one of the tasks. AutoML will blend level 1 models.\n",
"\n",
"[13:36:29] \u001B[1mLayer 1 training completed.\u001B[0m\n",
"[13:36:29] \u001b[1mLayer 1 training completed.\u001b[0m\n",
"\n",
"[13:36:29] \u001B[1mAutoml preset training completed in 3508.71 seconds\u001B[0m\n",
"[13:36:29] \u001b[1mAutoml preset training completed in 3508.71 seconds\u001b[0m\n",
"\n",
"[13:36:29] Model description:\n",
"Final prediction for new objects (level 0) = \n",
Expand Down Expand Up @@ -2091,9 +2091,9 @@
"\n",
"0. The general idea of method is find the most informative subset of tokens with respect to target using [Mutual Information](https://en.wikipedia.org/wiki/Mutual_information). The number of tokens in this subset is fixed and equals ```n_important```.\n",
"\n",
"1. There is may be some missunderstanding with tokenization that used inside models in automl and tokenization in this method. L2X has its own tokenization, so they are different. If it isn't set we infer it from default tokenization for language in ``text_params`` of ```TabularNLPAutoML```. Else you can set it with language: ``'ru'`` or ``'en'`` for russian and english languages, respectively. Also it can be scepcified as callable function that from string produces list of tokens.\n",
"1. There is may be some misunderstanding with tokenization that used inside models in automl and tokenization in this method. L2X has its own tokenization, so they are different. If it isn't set we infer it from default tokenization for language in ``text_params`` of ```TabularNLPAutoML```. Else you can set it with language: ``'ru'`` or ``'en'`` for russian and english languages, respectively. Also it can be scepcified as callable function that from string produces list of tokens.\n",
"\n",
"2. After tokenization sentence was presented as the matrix of embedding vectors (you can specify ``embedder`` or randomly initalized embeddings will be used). Not important vectors of this matrix will be masked (important tokens selected with Token Importance + Subset Sampler blocks), and the other use for model (Distil model), that tries to imitate the original automl model (learns to predict the same outputs).\n",
"2. After tokenization sentence was presented as the matrix of embedding vectors (you can specify ``embedder`` or randomly initialized embeddings will be used). Not important vectors of this matrix will be masked (important tokens selected with Token Importance + Subset Sampler blocks), and the other use for model (Distil model), that tries to imitate the original automl model (learns to predict the same outputs).\n",
"\n",
"3. Scheme of L2X:\n",
"\n",
Expand All @@ -2114,8 +2114,8 @@
" - ``train_batch_size`` - size of batch for training process;\n",
" - ``valid_batch_size`` - size of batch for validation process;\n",
" - ``temp_anneal_factor`` - annealing factor for temperature. The temperature will be multiplied by this coefficient every epoch.\n",
" - ``importance_sampler`` - specifices method of sampling importance (there are two of them ``'gumbeltopk'`` - method from the original paper, ``'softsub'`` - another method);\n",
" - `max_vocab_length` - maximum lenght of vocabular (vocabular build up from ``max_vocab_length`` the most frequent tokens). If ``max_vocab_length`` is ``-1`` then include all in train set.\n",
" - ``importance_sampler`` - specifies method of sampling importance (there are two of them ``'gumbeltopk'`` - method from the original paper, ``'softsub'`` - another method);\n",
" - `max_vocab_length` - maximum length of vocabular (vocabular build up from ``max_vocab_length`` the most frequent tokens). If ``max_vocab_length`` is ``-1`` then include all in train set.\n",
" - ``embedder`` - embedding dictionary or path to fasttext/dict of embeddings.\n",
" \n",
"5. Some links for more info about L2X:\n",
Expand Down
2 changes: 1 addition & 1 deletion examples/tutorials/Tutorial_6_custom_pipeline.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# Tutorial 6: Custom pipiline tutorial"
"# Tutorial 6: Custom pipeline tutorial"
]
},
{
Expand Down
Loading

0 comments on commit a12f81e

Please sign in to comment.