Day 2/2 (#68)

* add notebooks for 2/2 * fix links * add lgbm * remove unncessary files * fix nb * req * add page * projet * instructions * changes * add rétro * fix rst
sdpython · Feb 25, 2024 · b750121 · b750121
1 parent a6115ca
commit b750121
Show file tree

Hide file tree

Showing 10 changed files with 8,229 additions and 2 deletions.
diff --git a/.gitignore b/.gitignore
@@ -8,6 +8,7 @@
 *.log
 *.dbf
 *.xlsx
+*.pickle
 .coverage
 data.*
 paris*.*
@@ -48,6 +49,7 @@ _doc/practice/algo-compose/paris_54000.*
 _doc/practice/algo-base/*.csv
 _doc/practice/algo-base/*.txt
 _doc/practice/algo-base/*.zip
+_doc/practice/ml/catboost_info/*
 _doc/practice/py-base/*.csv
 _doc/practice/py-base/*.json
 _doc/practice/py-base/*.jpg

diff --git a/_doc/articles/2024/2024-03-01-route2024.rst b/_doc/articles/2024/2024-03-01-route2024.rst
@@ -22,3 +22,127 @@ Séance 1 (26/1)
 * `ChatGPT <https://chat.openai.com/>`_,
   `LLM <https://en.wikipedia.org/wiki/Large_language_model>`_,
   (Large Language Model), SLLM (Small LLM)
+
+Séance 2 (2/2)
+==============
+
+* arbre de régression, arbre de classification
+* random forest, boosting trees
+  (:epkg:`xgboost`, :epkg:`lightgtbm`, :epkg:`catboost`),
+  :ref:`RandomForest, Overfitting <nbl-practice-ml-ml_a_tree_overfitting>`
+* Gradient Boosting, :ref:`Gradient Boosting et Learning Rate avec les Random Forest <nbl-practice-ml-gradient_boosting>`
+* Régression Linéaire et contraintes sur les coefficients,
+  `Ridge <https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html>`_,
+  `Lasso <https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html>`_,
+  `ElasticNet <https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.ElasticNet.html>`_,
+  :ref:`Ridge, Lasso, mathématiques <nbl-practice-ml-ridge_lasso>`
+* Notion de :epkg:`pipeline` ou comment intégrer les prétraitements dans le modèle
+* prétraitements : tout convertir en numérique,
+  données numériques, catégorielles, textuelles
+* un jeu de données :
+  `load_diabetes <https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_diabetes.html>`_
+
+Séance 3 (8/2)
+==============
+
+* Notion de :epkg:`pipeline` ou comment intégrer les prétraitements dans le modèle
+* prétraitements : tout convertir en numérique,
+  données numériques, catégorielles, textuelles
+
+Séance 4 (16/2)
+===============
+
+* créer son propre estimateur
+* grille de recherche
+* traitement des valeurs manquantes
+* valeurs manquantes, gradient, méthodes ensemblistes
+* réseau de neurones : algorithme de `rétro-propagation
+  <https://sdpython.github.io/doc/mlstatpy/dev/c_ml/rn/rn_5_newton.html#calcul-du-gradient-ou-retropropagation>`_
+* cartes avec `geopandas <https://geopandas.org/en/stable/>`_
+* interprétabilité,
+  `"Why Should I Trust You?"" Explaining the Predictions of Any Classifier
+  <https://arxiv.org/pdf/1602.04938v1.pdf>`_,
+  `LIME <https://ema.drwhy.ai/LIME.html>`_,
+  `SHAP <https://ema.drwhy.ai/shapley.html>`_
+  `Partial Dependence Plot
+  <https://scikit-learn.org/stable/modules/partial_dependence.html>`_
+* machine learning éthique,
+  `Latanya Sweeney: How technology impacts humans and dictates our civic future
+  <https://www.youtube.com/watch?v=Buf0wLb86Lo>`_,
+  `Equality of Opportunity in Supervised Learning
+  <https://home.ttic.edu/~nati/Publications/HardtPriceSrebro2016.pdf>`_
+
+Séance 5 (23/2)
+===============
+
+* séries temporelles,
+  décomposition, `Holt Winters <https://otexts.com/fpp2/holt-winters.html>`_,
+  détection des changements de régime,
+  `Filtre de Kalman <http://www.cs.unc.edu/~welch/media/pdf/kalman_intro.pdf>`_,
+  `SSA <https://en.wikipedia.org/wiki/Singular_spectrum_analysis>`_
+* packages  `prophet <https://facebook.github.io/prophet/docs/quick_start.html>`_,
+  :epkg:`statsmodels`,
+  `ruptures <https://github.com/deepcharles/ruptures>`_,
+  `tslearn <https://github.com/tslearn-team/tslearn>`_,
+* analyse de survie
+* anomalies
+* recommandations
+  `NMF <https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.NMF.html>`_
+* ranking
+* `TSNE <https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html>`_
+* pytorch
+* `skorch <https://github.com/skorch-dev/skorch>`_
+* :epkg:`statsmodels`
+
+Projets
+=======
+
+Un sujet parmi deux.
+
+**Sujet 1**
+
+Ecrire un notebook ou un script qui construit
+pour n'importe quel problème de classification binaire
+une première solution et des premiers résultats.
+
+Ce notebook ou script doit détecter automatiquement les variables
+numériques, catégorielles et textuelles et appliquer
+le prétraitement appropriée, puis caler quelques modèles.
+
+L'idée est de construire une première baseline pour savoir si le problème
+est plus ou moins compliqué. On pourra notamment comparer le taux
+de bonne prédiciton à la proportion de de chaque classe.
+
+**Sujet 2**
+
+Une fois un modèle de machine learning appris, on veut écrire un notebook
+ou un script qui indique pour chaque observation et chaque variable,
+la variation à appliquer sur cette variable, et sans changer les autres,
+pour faire basculer le modèle de l'autre côté.
+
+Si le modèle dépend de deux variables X1 et X2, X1 est numérique
+et X2 catégorielle. On se pose la question de savoir comment changer
+X1 pour changer le résultat du modèle, ou si le modèle répond toujours
+la même classe quelle que soit la catégorie X2.
+
+L'idée est de comprendre si le modèle est localement sensible à une 
+variable.
+
+**Contraintes**
+
+* Un oral de 20 minutes le 5 avril,
+  10 minutes de présentation, 10 minutes de questions
+* Rendre son code le 2 avril avant minuit
+* Par groupe de 3
+* Le script ou notebook devra inclure un pipeline, un test unitaire, un graphe.
+* Chaque notebook devra être évalué sur deux jeu de données au choix.
+
+Le test unitaire est une fonction que le notebook ou le script
+retourne toujours la même chose sur un jeu de données très simple
+car ce qu'on veut obtenir sur ce jeu de données est connu à l'avance.
+
+Par exemple, si on a deux variables X1, X2 et une classe à apprendre
+qui vaut 1 si X1 > 5, 0 sinon. Le notebook du premier sujet doit
+répondre que le sujet est facile et la performance est de 100%
+de bonne classification. Le notebook du second sujet doit
+dire que la prédiction ne dépend pas de la variable X2.
diff --git a/_doc/conf.py b/_doc/conf.py
@@ -262,6 +262,7 @@
     "PiecewiseTreeRegressor": "https://sdpython.github.io/doc/mlinsights/dev/api/mlmodel_tree.html#piecewisetreeregressor",
     "Pillow": "https://pillow.readthedocs.io/en/stable/",
     "pip": "https://pip.pypa.io/en/stable/",
+    "pipeline": "https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html",
     "Predictable t-SNE": "https://sdpython.github.io/doc/mlinsights/dev/auto_examples/plot_predictable_tsne.html",
     "printf-style String Formatting": "https://docs.python.org/3/library/stdtypes.html#old-string-formatting",
     "programmation impérative": "https://fr.wikipedia.org/wiki/Programmation_imp%C3%A9rative",
@@ -387,12 +388,14 @@
 epkg_dictionary.update(
     {
         "cartopy": "https://scitools.org.uk/cartopy/docs/latest/",
+        "catboost": "https://catboost.ai/",
         "csv": "https://fr.wikipedia.org/wiki/Comma-separated_values",
         "Enedis": "https://data.enedis.fr/",
         "fonction": "https://fr.wikipedia.org/wiki/Fonction_(math%C3%A9matiques)",
         "fonction continue": "https://fr.wikipedia.org/wiki/Continuit%C3%A9_(math%C3%A9matiques)",
         "fortran": "https://en.wikipedia.org/wiki/Fortran",
         "GEOFLA": "https://www.data.gouv.fr/en/datasets/geofla-r/",
+        "lightgtbm": "https://lightgbm.readthedocs.io/en/stable/",
         "machine learning": "https://en.wikipedia.org/wiki/Machine_learning",
         "matrice de confusion": "https://fr.wikipedia.org/wiki/Matrice_de_confusion",
         "nuage de points": "https://fr.wikipedia.org/wiki/Nuage_de_points_(statistique)",
@@ -403,6 +406,7 @@
         "UCI": "https://archive.ics.uci.edu/datasets",
         "variable aléatoire": "https://fr.wikipedia.org/wiki/Variable_al%C3%A9atoire",
         "voyageur de commerce": "https://fr.wikipedia.org/wiki/Probl%C3%A8me_du_voyageur_de_commerce",
+        "xgboost": "https://xgboost.readthedocs.io/en/stable/",
     }
 )
 

diff --git a/_doc/notebook_gallery.rst b/_doc/notebook_gallery.rst
@@ -178,3 +178,5 @@ Machine Learning
     practice/ml/ml_features_model
     practice/ml/timeseries_ssa
     practice/ml/ml_a_tree_overfitting
+    practice/ml/gradient_boosting
+    practice/ml/ridge_lasso
diff --git a/_doc/practice/index_ml.rst b/_doc/practice/index_ml.rst
@@ -20,6 +20,8 @@ Machine Learning
     ml/winesr_knn_cross_val
     ml/winesr_knn_hyper
     ml/ml_a_tree_overfitting
+    ml/gradient_boosting
+    ml/ridge_lasso
 
 .. toctree::
     :maxdepth: 1

diff --git a/_doc/practice/ml/gradient_boosting.ipynb b/_doc/practice/ml/gradient_boosting.ipynb
diff --git a/_doc/practice/ml/ridge_lasso.ipynb b/_doc/practice/ml/ridge_lasso.ipynb
diff --git a/_unittests/ut_xrun_doc/test_documentation_notebook.py b/_unittests/ut_xrun_doc/test_documentation_notebook.py
@@ -119,6 +119,16 @@ def _test_(self, fullname=fullname):
                         res = self.run_test(fullname, verbose=VERBOSE)
                         self.assertIn(res, (-1, 1))
 
+                elif (
+                    "ml_a_tree_overfitting" in name
+                    and os.environ.get("CIRCLECI", "undefined") != "undefined"
+                ):
+
+                    @unittest.skip("issues with circleci")
+                    def _test_(self, fullname=fullname):
+                        res = self.run_test(fullname, verbose=VERBOSE)
+                        self.assertIn(res, (-1, 1))
+
                 else:
 
                     def _test_(self, fullname=fullname):

diff --git a/pyproject.toml b/pyproject.toml
@@ -132,11 +132,11 @@ exclude = [
 # Same as Black.
 line-length = 88
 
-[tool.ruff.mccabe]
+[tool.ruff.lint.mccabe]
 # Unlike Flake8, default to a complexity level of 10.
 max-complexity = 10
 
-[tool.ruff.per-file-ignores]
+[tool.ruff.lint.per-file-ignores]
 "_doc/conf.py" = ["F821", "E501"]
 "teachpyx/__init__.py" = ["E501"]
 "teachpyx/datasets/__init__.py" = ["F401"]

diff --git a/requirements-dev.txt b/requirements-dev.txt
@@ -2,6 +2,7 @@ black
 black-nb
 blockdiag
 cartopy
+catboost
 category-encoders
 chardet
 cloudpickle
@@ -16,6 +17,7 @@ ipython
 jinja2
 jupyter
 lifelines
+lightgbm
 lxml
 matplotlib
 mutagen  # mp3