Merge branch 'experiment/fix_codespell' into 'master'

fix codespell See merge request ai-lab-pmo/mltools/automl/LightAutoML!18
sb-ai-lab · Oct 24, 2024 · a12f81e · a12f81e
2 parents 2591a23 + a88ed9a
commit a12f81e
Show file tree

Hide file tree

Showing 66 changed files with 19,549 additions and 19,549 deletions.
diff --git a/docs/pages/Installation.rst b/docs/pages/Installation.rst
@@ -27,7 +27,7 @@ Then,
     # Create virtual environment inside your project directory
     poetry config virtualenvs.in-project true
 
-    # If you want to update dependecies, run the command:
+    # If you want to update dependencies, run the command:
     poetry lock
 
     # Installation

diff --git a/examples/data/jobs_train.csv b/examples/data/jobs_train.csv
diff --git a/examples/optimization/sequential_parameter_search.py b/examples/optimization/sequential_parameter_search.py
@@ -1,6 +1,6 @@
 # -*- encoding: utf-8 -*-
 
-"""Simple example for sequetial parameter search with OptunaTuner."""
+"""Simple example for sequential parameter search with OptunaTuner."""
 
 import copy
 

diff --git a/examples/tutorials/Tutorial_10_relational_data_with_star_scheme.ipynb b/examples/tutorials/Tutorial_10_relational_data_with_star_scheme.ipynb
diff --git a/examples/tutorials/Tutorial_11_time_series.ipynb b/examples/tutorials/Tutorial_11_time_series.ipynb
diff --git a/examples/tutorials/Tutorial_12_Matching.ipynb b/examples/tutorials/Tutorial_12_Matching.ipynb
@@ -1425,7 +1425,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## 4. No replacemet matching:\n",
+    "## 4. No replacement matching:\n",
     "\n",
     "    In order you need to just match groups with no replacement you may use additional method match_no_rep()."
    ]
@@ -1458,7 +1458,7 @@
    "outputs": [],
    "source": [
     "# you may specify threshold in order to receive only pair with 5% difference in post_spends\n",
-    "no_replacemet_df = model.match_no_rep(threshold=0.05) "
+    "no_replacement_df = model.match_no_rep(threshold=0.05) "
    ]
   },
   {
@@ -1579,7 +1579,7 @@
     }
    ],
    "source": [
-    "no_replacemet_df.head()"
+    "no_replacement_df.head()"
    ]
   },
   {
@@ -1599,7 +1599,7 @@
     }
    ],
    "source": [
-    "no_replacemet_df.shape"
+    "no_replacement_df.shape"
    ]
   },
   {

diff --git a/examples/tutorials/Tutorial_13_ABtesting.ipynb b/examples/tutorials/Tutorial_13_ABtesting.ipynb
@@ -1106,7 +1106,7 @@
    "metadata": {},
    "source": [
     "To perform experiment that separates samples by groups `group_col` <br>\n",
-    "and if there is group that shold not be separated `quant_field` can be used"
+    "and if there is group that should not be separated `quant_field` can be used"
    ]
   },
   {
@@ -1733,7 +1733,7 @@
     "### 3.1 Full AB-test\n",
     "\n",
     "Full (basic) version of test includes calculation of all available metrics, which are: \"diff in means\", \"diff in diff\" and \"cuped\"<br>\n",
-    "Pay attention, that for \"cuped\" and \"diff in diff\" metrics requred target before pilot."
+    "Pay attention, that for \"cuped\" and \"diff in diff\" metrics required target before pilot."
    ]
   },
   {

diff --git a/examples/tutorials/Tutorial_1_basics.ipynb b/examples/tutorials/Tutorial_1_basics.ipynb
@@ -842,7 +842,7 @@
    "source": [
     "Note: only logloss loss is available for binary task and it is the default loss. Default metric for binary classification is ROC-AUC. See more info about available and default losses and metrics [here](https://lightautoml.readthedocs.io/en/latest/pages/modules/generated/lightautoml.tasks.base.Task.html#lightautoml.tasks.base.Task). \n",
     "\n",
-    "**Depending on the task, you can and shold choose exactly those metrics and losses that you want and need to optimize.**"
+    "**Depending on the task, you can and should choose exactly those metrics and losses that you want and need to optimize.**"
    ]
   },
   {
@@ -994,7 +994,7 @@
     "\n",
     "**Role and types guessing**\n",
     "\n",
-    "Roles can be specified as a string or a specific class object, or defined automatically. For ```TabularAutoML``` preset ```'numeric'```, ```'datetime'``` and ```'category'``` roles can be automatically defined. There are two ways of role defining. **First** is very simple: check if the value can be converted to a date (```'datetime'```), otherwise check if it can be converted to a number (```'numeric'```), otherwise declare it a category (```'categorical'```). But this method may not work well on large data or when encoding categories with integers. The **second** method is based on statistics: the distributions of numerical features are considered, and how similar they are to the distributions of real or categorical value. Also different ways of feature encoding (as a number or as a category) are compared and based on normalized Gini index it is decided which encoding is better. For this case a set of specific rules is created, and if at least one of them is fullfilled, then the feature will be assigned to numerical, otherwise to categorical. This check can be enabled or disabled using the ```advanced_roles``` parameter. \n",
+    "Roles can be specified as a string or a specific class object, or defined automatically. For ```TabularAutoML``` preset ```'numeric'```, ```'datetime'``` and ```'category'``` roles can be automatically defined. There are two ways of role defining. **First** is very simple: check if the value can be converted to a date (```'datetime'```), otherwise check if it can be converted to a number (```'numeric'```), otherwise declare it a category (```'categorical'```). But this method may not work well on large data or when encoding categories with integers. The **second** method is based on statistics: the distributions of numerical features are considered, and how similar they are to the distributions of real or categorical value. Also different ways of feature encoding (as a number or as a category) are compared and based on normalized Gini index it is decided which encoding is better. For this case a set of specific rules is created, and if at least one of them is fulfilled, then the feature will be assigned to numerical, otherwise to categorical. This check can be enabled or disabled using the ```advanced_roles``` parameter. \n",
     "\n",
     "If roles are explicitly specified, automatic definition won't be applied to the specified dataset columns. In the case of specifying a role as an object of a certain class, through its arguments, it is possible to set the processing parameters in more detail.\n",
     " \n",
@@ -1571,7 +1571,7 @@
     "- Fast (`fast`) - this method uses feature importances from feature selector LGBM model inside LightAutoML. It works extremely fast and almost always (almost because of situations, when feature selection is turned off or selector was removed from the final models with all GBM models). There is no need to use new labelled data.\n",
     "- Accurate (`accurate`) - this method calculate *features permutation importances* for the whole LightAutoML model based on the **new labelled data**. It always works but can take a lot of time to finish (depending on the model structure, new labelled dataset size etc.).\n",
     "\n",
-    "In the cell below we will use `automl_rd.model` instead `automl_rd` because we want to take the importances from the model, not from the report. But **be carefull** - everything, which is calculated using `automl_rd.model` will not go to the report."
+    "In the cell below we will use `automl_rd.model` instead `automl_rd` because we want to take the importances from the model, not from the report. But **be careful** - everything, which is calculated using `automl_rd.model` will not go to the report."
    ]
   },
   {
@@ -4433,7 +4433,7 @@
    "id": "0b1f7773",
    "metadata": {},
    "source": [
-    "### Multi-label classifcation\n",
+    "### Multi-label classification\n",
     "\n",
     "Now let's consider multi-label classification task, here you will use the same dataset as in section above (Anuran Calls (MFCCs) Data Set ). Let's pick labels in each column:"
    ]

diff --git a/examples/tutorials/Tutorial_2_WhiteBox_AutoWoE.ipynb b/examples/tutorials/Tutorial_2_WhiteBox_AutoWoE.ipynb
@@ -756,7 +756,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "### AutoWoE - simplier model"
+    "### AutoWoE - simpler model"
    ]
   },
   {

diff --git a/examples/tutorials/Tutorial_4_NLP_Interpretation.ipynb b/examples/tutorials/Tutorial_4_NLP_Interpretation.ipynb
@@ -12,7 +12,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "<img src=\"../../imgs/LightAutoML_logo_big.png\" alt=\"LightAutoML logo\" style=\"width:100%;\"/>"
+    "<img src=\"../../docs/imgs/lightautoml_logo_color.png\" alt=\"LightAutoML logo\" style=\"width:100%;\"/>"
    ]
   },
   {
@@ -35,7 +35,7 @@
    "metadata": {},
    "source": [
     "The last years deep neural networks / gradient boosting / ensembles of models allow to improve the soulution quality of many application task in field of natural language processing (NLP). The indicators of this improvement describe the partial behavior of the model and can hide errors, for example, errors in the construction of the model, errors in data collection. All this can be critical in tasks related to the processing of medical, forensic, banking data.\n",
-    "In this tutorial we will check the NLP interpetation module of automl."
+    "In this tutorial we will check the NLP interpretation module of automl."
    ]
   },
   {
@@ -83,7 +83,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Dowload data"
+    "## Download data"
    ]
   },
   {
@@ -377,12 +377,12 @@
       "[11:22:30] - CPU: 1 cores\n",
       "[11:22:30] - memory: 16 GB\n",
       "\n",
-      "[11:22:30] \u001B[1mTrain data shape: (127656, 8)\u001B[0m\n",
+      "[11:22:30] \u001b[1mTrain data shape: (127656, 8)\u001b[0m\n",
       "\n",
-      "[11:22:30] Layer \u001B[1m1\u001B[0m train process start. Time left 3599.85 secs\n",
-      "[11:22:31] Start fitting \u001B[1mLvl_0_Pipe_0_Mod_0_TorchNN\u001B[0m ...\n",
+      "[11:22:30] Layer \u001b[1m1\u001b[0m train process start. Time left 3599.85 secs\n",
+      "[11:22:31] Start fitting \u001b[1mLvl_0_Pipe_0_Mod_0_TorchNN\u001b[0m ...\n",
       "[11:22:31] Training params: {'bs': 32, 'num_workers': 1, 'max_length': 128, 'opt_params': {'lr': 1e-05}, 'scheduler_params': {'patience': 5, 'factor': 0.5, 'verbose': True}, 'is_snap': False, 'snap_params': {'k': 1, 'early_stopping': True, 'patience': 1, 'swa': False}, 'init_bias': True, 'n_epochs': 7, 'input_bn': False, 'emb_dropout': 0.1, 'emb_ratio': 3, 'max_emb_size': 50, 'bert_name': 'prajjwal1/bert-tiny', 'pooling': 'cls', 'device': device(type='cuda', index=0), 'use_cont': True, 'use_cat': True, 'use_text': True, 'lang': 'en', 'deterministic': False, 'multigpu': False, 'random_state': 42, 'path_to_save': None, 'verbose_inside': None, 'verbose': 1, 'device_ids': None, 'n_out': 1, 'cat_features': [], 'cat_dims': [], 'cont_features': [], 'cont_dim': 0, 'text_features': ['concated__comment_text'], 'bias': array([[-2.24401446]])}\n",
-      "[11:22:31] ===== Start working with \u001B[1mfold 0\u001B[0m for \u001B[1mLvl_0_Pipe_0_Mod_0_TorchNN\u001B[0m =====\n",
+      "[11:22:31] ===== Start working with \u001b[1mfold 0\u001b[0m for \u001b[1mLvl_0_Pipe_0_Mod_0_TorchNN\u001b[0m =====\n",
       "[11:22:36] number of text features: 1 \n",
       "[11:22:36] number of categorical features: 0 \n",
       "[11:22:36] number of continuous features: 0 \n"
@@ -497,7 +497,7 @@
      "output_type": "stream",
      "text": [
       "[11:45:22] Epoch: 6, train loss: 0.09056100249290466, val loss: 0.10337436944246292, val metric: 0.9788043902058639\n",
-      "[11:45:23] ===== Start working with \u001B[1mfold 1\u001B[0m for \u001B[1mLvl_0_Pipe_0_Mod_0_TorchNN\u001B[0m =====\n",
+      "[11:45:23] ===== Start working with \u001b[1mfold 1\u001b[0m for \u001b[1mLvl_0_Pipe_0_Mod_0_TorchNN\u001b[0m =====\n",
       "[11:45:28] number of text features: 1 \n",
       "[11:45:28] number of categorical features: 0 \n",
       "[11:45:28] number of continuous features: 0 \n"
@@ -625,7 +625,7 @@
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "[12:07:40] ===== Start working with \u001B[1mfold 2\u001B[0m for \u001B[1mLvl_0_Pipe_0_Mod_0_TorchNN\u001B[0m =====\n",
+      "[12:07:40] ===== Start working with \u001b[1mfold 2\u001b[0m for \u001b[1mLvl_0_Pipe_0_Mod_0_TorchNN\u001b[0m =====\n",
       "[12:07:44] number of text features: 1 \n",
       "[12:07:44] number of categorical features: 0 \n",
       "[12:07:44] number of continuous features: 0 \n"
@@ -762,15 +762,15 @@
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "[12:30:22] Fitting \u001B[1mLvl_0_Pipe_0_Mod_0_TorchNN\u001B[0m finished. score = \u001B[1m0.9782371823652668\u001B[0m\n",
-      "[12:30:22] \u001B[1mLvl_0_Pipe_0_Mod_0_TorchNN\u001B[0m fitting and predicting completed\n",
+      "[12:30:22] Fitting \u001b[1mLvl_0_Pipe_0_Mod_0_TorchNN\u001b[0m finished. score = \u001b[1m0.9782371823652668\u001b[0m\n",
+      "[12:30:22] \u001b[1mLvl_0_Pipe_0_Mod_0_TorchNN\u001b[0m fitting and predicting completed\n",
       "[12:30:22] Time left -472.15 secs\n",
       "\n",
       "[12:30:22] Time limit exceeded. Last level models will be blended and unused pipelines will be pruned.\n",
       "\n",
-      "[12:30:22] \u001B[1mLayer 1 training completed.\u001B[0m\n",
+      "[12:30:22] \u001b[1mLayer 1 training completed.\u001b[0m\n",
       "\n",
-      "[12:30:22] \u001B[1mAutoml preset training completed in 4072.15 seconds\u001B[0m\n",
+      "[12:30:22] \u001b[1mAutoml preset training completed in 4072.15 seconds\u001b[0m\n",
       "\n",
       "[12:30:22] Model description:\n",
       "Final prediction for new objects (level 0) = \n",
@@ -1918,11 +1918,11 @@
       "[12:38:00] - CPU: 1 cores\n",
       "[12:38:00] - memory: 16 GB\n",
       "\n",
-      "[12:38:00] \u001B[1mTrain data shape: (80000, 7)\u001B[0m\n",
+      "[12:38:00] \u001b[1mTrain data shape: (80000, 7)\u001b[0m\n",
       "\n",
-      "[12:38:01] Layer \u001B[1m1\u001B[0m train process start. Time left 3599.63 secs\n",
-      "[12:38:01] Start fitting \u001B[1mLvl_0_Pipe_0_Mod_0_TorchNN\u001B[0m ...\n",
-      "[12:38:01] ===== Start working with \u001B[1mfold 0\u001B[0m for \u001B[1mLvl_0_Pipe_0_Mod_0_TorchNN\u001B[0m =====\n"
+      "[12:38:01] Layer \u001b[1m1\u001b[0m train process start. Time left 3599.63 secs\n",
+      "[12:38:01] Start fitting \u001b[1mLvl_0_Pipe_0_Mod_0_TorchNN\u001b[0m ...\n",
+      "[12:38:01] ===== Start working with \u001b[1mfold 0\u001b[0m for \u001b[1mLvl_0_Pipe_0_Mod_0_TorchNN\u001b[0m =====\n"
      ]
     },
     {
@@ -1942,7 +1942,7 @@
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "[13:07:23] ===== Start working with \u001B[1mfold 1\u001B[0m for \u001B[1mLvl_0_Pipe_0_Mod_0_TorchNN\u001B[0m =====\n"
+      "[13:07:23] ===== Start working with \u001b[1mfold 1\u001b[0m for \u001b[1mLvl_0_Pipe_0_Mod_0_TorchNN\u001b[0m =====\n"
      ]
     },
     {
@@ -1964,15 +1964,15 @@
      "text": [
       "[13:36:29] Time limit exceeded after calculating fold 1\n",
       "\n",
-      "[13:36:29] Fitting \u001B[1mLvl_0_Pipe_0_Mod_0_TorchNN\u001B[0m finished. score = \u001B[1m-0.46728458911890136\u001B[0m\n",
-      "[13:36:29] \u001B[1mLvl_0_Pipe_0_Mod_0_TorchNN\u001B[0m fitting and predicting completed\n",
+      "[13:36:29] Fitting \u001b[1mLvl_0_Pipe_0_Mod_0_TorchNN\u001b[0m finished. score = \u001b[1m-0.46728458911890136\u001b[0m\n",
+      "[13:36:29] \u001b[1mLvl_0_Pipe_0_Mod_0_TorchNN\u001b[0m fitting and predicting completed\n",
       "[13:36:29] Time left 91.29 secs\n",
       "\n",
       "[13:36:29] Time limit exceeded in one of the tasks. AutoML will blend level 1 models.\n",
       "\n",
-      "[13:36:29] \u001B[1mLayer 1 training completed.\u001B[0m\n",
+      "[13:36:29] \u001b[1mLayer 1 training completed.\u001b[0m\n",
       "\n",
-      "[13:36:29] \u001B[1mAutoml preset training completed in 3508.71 seconds\u001B[0m\n",
+      "[13:36:29] \u001b[1mAutoml preset training completed in 3508.71 seconds\u001b[0m\n",
       "\n",
       "[13:36:29] Model description:\n",
       "Final prediction for new objects (level 0) = \n",
@@ -2091,9 +2091,9 @@
     "\n",
     "0. The general idea of method is find the most informative subset of tokens with respect to target using [Mutual Information](https://en.wikipedia.org/wiki/Mutual_information). The number of tokens in this subset is fixed and equals ```n_important```.\n",
     "\n",
-    "1. There is may be some missunderstanding with tokenization that used inside models in automl and tokenization in this method. L2X has its own tokenization, so they are different. If it isn't set we infer it from default tokenization for language in ``text_params`` of ```TabularNLPAutoML```. Else you can set it with language: ``'ru'`` or ``'en'`` for russian and english languages, respectively. Also it can be scepcified as callable function that from string produces list of tokens.\n",
+    "1. There is may be some misunderstanding with tokenization that used inside models in automl and tokenization in this method. L2X has its own tokenization, so they are different. If it isn't set we infer it from default tokenization for language in ``text_params`` of ```TabularNLPAutoML```. Else you can set it with language: ``'ru'`` or ``'en'`` for russian and english languages, respectively. Also it can be scepcified as callable function that from string produces list of tokens.\n",
     "\n",
-    "2. After tokenization sentence was presented as the matrix of embedding vectors (you can specify ``embedder``  or randomly initalized embeddings will be used). Not important vectors of this matrix will be masked (important tokens selected with Token Importance + Subset Sampler blocks), and the other use for model (Distil model), that tries to imitate the original automl model (learns to predict the same outputs).\n",
+    "2. After tokenization sentence was presented as the matrix of embedding vectors (you can specify ``embedder``  or randomly initialized embeddings will be used). Not important vectors of this matrix will be masked (important tokens selected with Token Importance + Subset Sampler blocks), and the other use for model (Distil model), that tries to imitate the original automl model (learns to predict the same outputs).\n",
     "\n",
     "3. Scheme of L2X:\n",
     "\n",
@@ -2114,8 +2114,8 @@
     " - ``train_batch_size`` - size of batch for training process;\n",
     " - ``valid_batch_size`` - size of batch for validation process;\n",
     " - ``temp_anneal_factor`` - annealing factor for temperature. The temperature will be multiplied by this coefficient every epoch.\n",
-    " - ``importance_sampler`` - specifices method of sampling importance (there are two of them ``'gumbeltopk'`` - method from the original paper, ``'softsub'`` - another method);\n",
-    " - `max_vocab_length` - maximum lenght of vocabular (vocabular build up from ``max_vocab_length`` the most frequent tokens). If ``max_vocab_length`` is ``-1`` then include all in train set.\n",
+    " - ``importance_sampler`` - specifies method of sampling importance (there are two of them ``'gumbeltopk'`` - method from the original paper, ``'softsub'`` - another method);\n",
+    " - `max_vocab_length` - maximum length of vocabular (vocabular build up from ``max_vocab_length`` the most frequent tokens). If ``max_vocab_length`` is ``-1`` then include all in train set.\n",
     " - ``embedder`` - embedding dictionary or path to fasttext/dict of embeddings.\n",
     " \n",
     "5. Some links for more info about L2X:\n",

diff --git a/examples/tutorials/Tutorial_6_custom_pipeline.ipynb b/examples/tutorials/Tutorial_6_custom_pipeline.ipynb
@@ -4,7 +4,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Tutorial 6: Custom pipiline tutorial"
+    "# Tutorial 6: Custom pipeline tutorial"
    ]
   },
   {