diff --git a/docs/hero.css b/docs/hero.css
new file mode 100644
index 0000000..2c5a77c
--- /dev/null
+++ b/docs/hero.css
@@ -0,0 +1,3 @@
+.column-page {
+ padding-top: 0%;
+}
\ No newline at end of file
diff --git a/docs/hero.html b/docs/hero.html
index b9a022a..a371b2d 100644
--- a/docs/hero.html
+++ b/docs/hero.html
@@ -128,17 +128,9 @@
-
-
-
-
-
Taija
-
-
Trustworthy Artificial Intelligence in Julia
-
Taija is the organization that hosts software geared towards Trustworthy Artificial Intelligence in Julia.
-
-
-
+
+
+
@@ -194,7 +186,18 @@
+
Make sense of your AI models
+
Artificial Intelligence (AI) has been advancing rapidly in recent years. Consequently, Julia’s AI ecosystem has also been growing fast. Taija is an effort to provide users with tools to make sense of the AI models that they train and deploy. Some highlights include:
Taija is the organization that hosts software geared towards Trustworthy Artificial Intelligence in Julia.
-
+
+
+
+
@@ -191,7 +192,18 @@
Trustworthy Ar
-
+
+
Make sense of your AI models
+
Artificial Intelligence (AI) has been advancing rapidly in recent years. Consequently, Julia’s AI ecosystem has also been growing fast. Taija is an effort to provide users with tools to make sense of the AI models that they train and deploy. Some highlights include:
Taija is a community effort largely maintained by academics and students at TU Delft. We welcome contributions of any kind.
+
diff --git a/docs/search.json b/docs/search.json
index 3d8f759..d264d95 100644
--- a/docs/search.json
+++ b/docs/search.json
@@ -1,206 +1,262 @@
[
{
- "objectID": "blog/index.html",
- "href": "blog/index.html",
- "title": "Posts",
+ "objectID": "welcome.html",
+ "href": "welcome.html",
+ "title": "Welcome to Taija",
"section": "",
- "text": "Building a Conformal Chatbot in Julia\n\n\nHuggingFace, Transformers, and Conformal Prediction - Part 1\n\n\nFor this year’s edition of the ING Analytics Experiment Week, we put ConformalPrediction.jl to work and built a chatbot that can be used for Conformal Intent Recognition.\n\n\n\n\n\nJul 5, 2023\n\n\nPatrick Altmeyer\n\n\n7 min\n\n\n8/28/24, 5:16:41 PM\n\n\n\n\n\n\n\n\n\n\n\n\nPaving the Way Towards Low-Overhead Uncertainty Calibration\n\n\nAn Accessible Intro to Laplace Approximations in Julia for Bayesian Deep Learning\n\n\nA guest blog post by a team of students from TU Delft, who have contributed multiple improvements to LaplaceRedux.jl.\n\n\n\n\n\nJul 4, 2023\n\n\nPatrick Altmeyer, Severin Bratus, Mark Ardman, Adelina Cazacu, Andrei Ionescu, Ivan Makarov\n\n\n11 min\n\n\n8/28/24, 4:44:08 PM\n\n\n\n\n\n\n\n\n\n\n\n\nPrediction Intervals for any Regression Model\n\n\nConformal Prediction in Julia — Part 3\n\n\nThis third post introduces conformal regression by going through a standard machine learning workflow using MLJ.jl and ConformalPrediction.jl.\n\n\n\n\n\nDec 12, 2022\n\n\nPatrick Altmeyer\n\n\n11 min\n\n\n8/28/24, 4:44:08 PM\n\n\n\n\n\n\n\n\n\n\n\n\nHow to Conformalize a Deep Image Classifier\n\n\nConformal Prediction in Julia — Part 2\n\n\nA guide demonstrating how to use ConformalPrediction.jl to conformalize a deep image classifier in a few lines of code.\n\n\n\n\n\nDec 5, 2022\n\n\nPatrick Altmeyer\n\n\n9 min\n\n\n8/28/24, 4:44:08 PM\n\n\n\n\n\n\n\n\n\n\n\n\nConformal Prediction in Julia 🟣🔴🟢\n\n\nConformal Prediction in Julia — Part 1\n\n\nA (very) gentle introduction to Conformal Prediction in Julia using my new package ConformalPrediction.jl.\n\n\n\n\n\nOct 25, 2022\n\n\nPatrick Altmeyer\n\n\n15 min\n\n\n8/28/24, 4:44:08 PM\n\n\n\n\n\n\n\n\n\n\n\n\nA new tool for explainable AI\n\n\nCounterfactual Explanations in Julia — Part I\n\n\nThis post introduces a new Julia package for generating counterfactual explanations. The package can be used to explain machine learning algorithms developed and trained in Julia as well as other popular programming languages like Python and R. \n\n\n\n\n\nApr 20, 2022\n\n\nPatrick Altmeyer\n\n\n12 min\n\n\n8/28/24, 4:44:08 PM\n\n\n\n\n\n\n\n\n\n\n\n\nGo deep, but also … go Bayesian!\n\n\nEffortless Bayesian Deep Learning in Julia — Part I\n\n\nAn introduction to effortless Bayesian deep learning through Laplace approximation coded from scratch in Julia.\n\n\n\n\n\nFeb 18, 2022\n\n\nPatrick Altmeyer\n\n\n12 min\n\n\n8/28/24, 4:44:08 PM\n\n\n\n\n\n\nNo matching items"
+ "text": "Taija is the organization that hosts software geared towards Trustworthy Artificial Intelligence in Julia."
},
{
- "objectID": "blog/posts/conformal-llm/index.html",
- "href": "blog/posts/conformal-llm/index.html",
- "title": "Building a Conformal Chatbot in Julia",
+ "objectID": "welcome.html#trustworthy-artificial-intelligence-in-julia",
+ "href": "welcome.html#trustworthy-artificial-intelligence-in-julia",
+ "title": "Welcome to Taija",
"section": "",
- "text": "Short demo of our conformal chatbot.\nLarge Language Models are all the buzz right now. They are used for a variety of tasks, including text classification, question answering, and text generation. In this tutorial, we will show how to conformalize a transformer language model for text classification. We will use the Banking77 dataset (Casanueva et al. 2020), which consists of 13,083 queries from 77 intents. On the model side, we will use the DistilRoBERTa model, which is a distilled version of RoBERTa (Liu et al. 2019) finetuned on the Banking77 dataset."
+ "text": "Taija is the organization that hosts software geared towards Trustworthy Artificial Intelligence in Julia."
},
{
- "objectID": "blog/posts/conformal-llm/index.html#huggingface-model",
- "href": "blog/posts/conformal-llm/index.html#huggingface-model",
- "title": "Building a Conformal Chatbot in Julia",
- "section": "🤗 HuggingFace Model",
- "text": "🤗 HuggingFace Model\nThe model can be loaded from HF straight into our running Julia session using the Transformers.jl package. Below we load the tokenizer tkr and the model mod. The tokenizer is used to convert the text into a sequence of integers, which is then fed into the model. The model outputs a hidden state, which is then fed into a classifier to get the logits for each class. Finally, the logits are then passed through a softmax function to get the corresponding predicted probabilities. Below we run a few queries through the model to see how it performs.\n\n\nCode\n# Load model from HF 🤗:\ntkr = hgf\"mrm8488/distilroberta-finetuned-banking77:tokenizer\"\nmod = hgf\"mrm8488/distilroberta-finetuned-banking77:ForSequenceClassification\"\n\n# Test model:\nquery = [\n \"What is the base of the exchange rates?\",\n \"Why is my card not working?\",\n \"My Apple Pay is not working, what should I do?\",\n]\na = encode(tkr, query)\nb = mod.model(a)\nc = mod.cls(b.hidden_state)\nd = softmax(c.logit)\n[labels[i] for i in Flux.onecold(d)]\n\n\n3-element Vector{String}:\n \"exchange_rate\"\n \"card_not_working\"\n \"apple_pay_or_google_pay\""
+ "objectID": "blog/posts/conformal-regression/index.html",
+ "href": "blog/posts/conformal-regression/index.html",
+ "title": "Prediction Intervals for any Regression Model",
+ "section": "",
+ "text": "Conformal Prediction intervals for differentcoverage rates. As coverage grows, so doesthe width of the prediction interval.\nThis is the third (and for now final) part of a series of posts that introduce Conformal Prediction in Julia using ConformalPrediction.jl. The first post introduced Conformal Prediction for supervised classification tasks: we learned that conformal classifiers produce set-valued predictions that are guaranteed to include the true label of a new sample with a certain probability. In the second post we applied these ideas to a more hands-on example: we saw how easy it is to use ConformalPrediction.jl to conformalize a Deep Learning image classifier.\nIn this post, we will look at regression models instead, that is supervised learning tasks involving a continuous outcome variable. Regression tasks are as ubiquitous as classification tasks. For example, we might be interested in using a machine learning model to predict house prices or the inflation rate of the Euro or the parameter size of the next large language model. In fact, many readers may be more familiar with regression models than classification, in which case it may also be easier for you to understand Conformal Prediction (CP) in this context."
},
{
- "objectID": "blog/posts/conformal-llm/index.html#mlj-interface",
- "href": "blog/posts/conformal-llm/index.html#mlj-interface",
- "title": "Building a Conformal Chatbot in Julia",
- "section": "🔁 MLJ Interface",
- "text": "🔁 MLJ Interface\nSince our package is interfaced to MLJ.jl, we need to define a wrapper model that conforms to the MLJ interface. In order to add the model for general use, we would probably go through MLJFlux.jl, but for this tutorial, we will make our life easy and simply overload the MLJBase.fit and MLJBase.predict methods. Since the model from HF is already pre-trained and we are not interested in further fine-tuning, we will simply return the model object in the MLJBase.fit method. The MLJBase.predict method will then take the model object and the query and return the predicted probabilities. We also need to define the MLJBase.target_scitype and MLJBase.predict_mode methods. The former tells MLJ what the output type of the model is, and the latter can be used to retrieve the label with the highest predicted probability.\n\n\nCode\nstruct IntentClassifier <: MLJBase.Probabilistic\n tkr::TextEncoders.AbstractTransformerTextEncoder\n mod::HuggingFace.HGFRobertaForSequenceClassification\nend\n\nfunction IntentClassifier(;\n tokenizer::TextEncoders.AbstractTransformerTextEncoder, \n model::HuggingFace.HGFRobertaForSequenceClassification,\n)\n IntentClassifier(tkr, mod)\nend\n\nfunction get_hidden_state(clf::IntentClassifier, query::Union{AbstractString, Vector{<:AbstractString}})\n token = encode(clf.tkr, query)\n hidden_state = clf.mod.model(token).hidden_state\n return hidden_state\nend\n\n# This doesn't actually retrain the model, but it retrieves the classifier object\nfunction MLJBase.fit(clf::IntentClassifier, verbosity, X, y)\n cache=nothing\n report=nothing\n fitresult = (clf = clf.mod.cls, labels = levels(y))\n return fitresult, cache, report\nend\n\nfunction MLJBase.predict(clf::IntentClassifier, fitresult, Xnew)\n output = fitresult.clf(get_hidden_state(clf, Xnew))\n p̂ = UnivariateFinite(fitresult.labels,softmax(output.logit)',pool=missing)\n return p̂\nend\n\nMLJBase.target_scitype(clf::IntentClassifier) = AbstractVector{<:Finite}\n\nMLJBase.predict_mode(clf::IntentClassifier, fitresult, Xnew) = mode.(MLJBase.predict(clf, fitresult, Xnew))\n\n\nTo test that everything is working as expected, we fit the model and generated predictions for a subset of the test data:\n\n\nCode\nclf = IntentClassifier(tkr, mod)\ntop_n = 10\nfitresult, _, _ = MLJBase.fit(clf, 1, nothing, y_test[1:top_n])\n@time ŷ = MLJBase.predict(clf, fitresult, queries_test[1:top_n]);\n\n\n 1.923436 seconds (8.61 M allocations: 631.348 MiB, 2.99% gc time, 84.31% compilation time)"
+ "objectID": "blog/posts/conformal-regression/index.html#background",
+ "href": "blog/posts/conformal-regression/index.html#background",
+ "title": "Prediction Intervals for any Regression Model",
+ "section": "📖 Background",
+ "text": "📖 Background\nBefore we start, let’s briefly recap what CP is all about. Don’t worry, we’re not about to deep-dive into methodology. But just to give you a high-level description upfront:\n\nConformal prediction (a.k.a. conformal inference) is a user-friendly paradigm for creating statistically rigorous uncertainty sets/intervals for the predictions of such models. Critically, the sets are valid in a distribution-free sense: they possess explicit, non-asymptotic guarantees even without distributional assumptions or model assumptions.\n— Angelopoulos and Bates (2022) (arXiv)\n\nIntuitively, CP works under the premise of turning heuristic notions of uncertainty into rigorous uncertainty estimates through repeated sampling or the use of dedicated calibration data.\nIn what follows we will explore what CP can do by going through a standard machine learning workflow using MLJ.jl and ConformalPrediction.jl. There will be less focus on how exactly CP works, but references will point you to additional resources.\n\n\n\n\n\n\nInteractive Version\n\n\n\nThis post is also available as a fully interactive Pluto.jl 🎈 notebook hosted on binder: \nIn my own experience, this may take some time to load, certainly long enough to get yourself a hot beverage ☕ or first read on here. But I promise you that the wait is worth it!"
},
{
- "objectID": "blog/posts/conformal-llm/index.html#conformal-chatbot",
- "href": "blog/posts/conformal-llm/index.html#conformal-chatbot",
- "title": "Building a Conformal Chatbot in Julia",
- "section": "🤖 Conformal Chatbot",
- "text": "🤖 Conformal Chatbot\nTo turn the wrapped, pre-trained model into a conformal intent classifier, we can now rely on standard API calls. We first wrap our atomic model where we also specify the desired coverage rate and method. Since even simple forward passes are computationally expensive for our (small) LLM, we rely on Simple Inductive Conformal Classification.\nconf_model = conformal_model(clf; coverage=0.99, method=:simple_inductive, train_ratio=train_ratio)\nmach = machine(conf_model, queries, y)\n@time fit!(mach)\nSerialization.serialize(\"dev/private/simple_inductive.jls\", mach)\nFinally, we use our conformal LLM to build a simple yet powerful chatbot that runs directly in the Julia REPL. Without dwelling on the details too much, the conformal_chatbot works as follows:\n\nPrompt user to explain their intent.\nFeed user input through conformal LLM and present the output to the user.\nIf the conformal prediction set includes more than one label, prompt the user to either refine their input or choose one of the options included in the set.\n\n\n\nCode\nmach = Serialization.deserialize(\"../dev/private/simple_inductive.jls\")\n\nfunction prediction_set(mach, query::String)\n p̂ = MLJBase.predict(mach, query)[1]\n probs = pdf.(p̂, collect(1:77))\n in_set = findall(probs .!= 0)\n labels_in_set = labels[in_set]\n probs_in_set = probs[in_set]\n _order = sortperm(-probs_in_set)\n plt = UnicodePlots.barplot(labels_in_set[_order], probs_in_set[_order], title=\"Possible Intents\")\n return labels_in_set, plt\nend\n\nfunction conformal_chatbot()\n println(\"👋 Hi, I'm a Julia, your conformal chatbot. I'm here to help you with your banking query. Ask me anything or type 'exit' to exit ...\\n\")\n completed = false\n queries = \"\"\n while !completed\n query = readline()\n queries = queries * \",\" * query\n labels, plt = prediction_set(mach, queries)\n if length(labels) > 1\n println(\"🤔 Hmmm ... I can think of several options here. If any of these applies, simply type the corresponding number (e.g. '1' for the first option). Otherwise, can you refine your question, please?\\n\")\n println(plt)\n else\n println(\"🥳 I think you mean $(labels[1]). Correct?\")\n end\n\n # Exit:\n if query == \"exit\"\n println(\"👋 Bye!\")\n break\n end\n if query ∈ string.(collect(1:77))\n println(\"👍 Great! You've chosen '$(labels[parse(Int64, query)])'. I'm glad I could help you. Have a nice day!\")\n completed = true\n end\n end\nend\n\n\nBelow we show the output for two example queries. The first one is very ambiguous. As expected, the size of the prediction set is therefore large.\n\n\nCode\nambiguous_query = \"transfer mondey?\"\nprediction_set(mach, ambiguous_query)[2]\n\n\n\n Possible Intents \n ┌ ┐ \n beneficiary_not_allowed ┤■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ 0.150517 \n balance_not_updated_after_bank_transfer ┤■■■■■■■■■■■■■■■■■■■■■■ 0.111409 \n transfer_into_account ┤■■■■■■■■■■■■■■■■■■■ 0.0939535 \n transfer_not_received_by_recipient ┤■■■■■■■■■■■■■■■■■■ 0.091163 \n top_up_by_bank_transfer_charge ┤■■■■■■■■■■■■■■■■■■ 0.0893061 \n failed_transfer ┤■■■■■■■■■■■■■■■■■■ 0.0888321 \n transfer_timing ┤■■■■■■■■■■■■■ 0.0641954 \n transfer_fee_charged ┤■■■■■■■ 0.0361131 \n pending_transfer ┤■■■■■ 0.0270795 \n receiving_money ┤■■■■■ 0.0252126 \n └ ┘ \n\n\n\nThe more refined version of the prompt yields a smaller prediction set: less ambiguous prompts result in lower predictive uncertainty.\n\n\nCode\nrefined_query = \"I tried to transfer money to my friend, but it failed.\"\nprediction_set(mach, refined_query)[2]\n\n\n\n Possible Intents \n ┌ ┐ \n failed_transfer ┤■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ 0.59042 \n beneficiary_not_allowed ┤■■■■■■■ 0.139806 \n transfer_not_received_by_recipient ┤■■ 0.0449784 \n balance_not_updated_after_bank_transfer ┤■■ 0.037894 \n └ ┘ \n\n\n\nBelow we include a short demo video that shows the REPL-based chatbot in action."
+ "objectID": "blog/posts/conformal-regression/index.html#data",
+ "href": "blog/posts/conformal-regression/index.html#data",
+ "title": "Prediction Intervals for any Regression Model",
+ "section": "📈 Data",
+ "text": "📈 Data\nMost machine learning workflows start with data. For illustrative purposes we will work with synthetic data. The helper function below can be used to generate some regression data.\n\n\nCode\nfunction get_data(;N=1000, xmax=3.0, noise=0.5, fun::Function=fun(X) = X * sin(X))\n # Inputs:\n d = Distributions.Uniform(-xmax, xmax)\n X = rand(d, N)\n X = MLJBase.table(reshape(X, :, 1))\n\n # Outputs:\n ε = randn(N) .* noise\n y = @.(fun(X.x1)) + ε\n y = vec(y)\n return X, y\nend\n\n\nFigure 1 illustrates our observations (dots) along with the ground-truth mapping from inputs to outputs (line). We have defined that mapping \\(f: \\mathcal{X} \\mapsto \\mathcal{Y}\\) as follows:\n\n\nCode\nf(X) = X * cos(X)\n\n\n\n\n\n\n\n\n\n \n \n \n\n\n\n \n \n \n\n\n\n \n \n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nFigure 1: Some synthetic regression data. Observations are shown as dots. The ground-truth mapping from inputs to outputs is shown as a dashed line."
},
{
- "objectID": "blog/posts/conformal-llm/index.html#wrapping-up",
- "href": "blog/posts/conformal-llm/index.html#wrapping-up",
- "title": "Building a Conformal Chatbot in Julia",
- "section": "🌯 Wrapping Up",
- "text": "🌯 Wrapping Up\nThis work was done in collaboration with colleagues at ING as part of the ING Analytics 2023 Experiment Week. Our team demonstrated that Conformal Prediction provides a powerful and principled alternative to top-K intent classification. We won the first prize by popular vote.\nThere are a lot of things that can be improved. As far as LLMs are concerned, we have of course used a fairly small model here. In terms of Conformal Prediction, we have relied on simple inductive conformal classification. This is a good starting point, but there are more advanced methods available (and implemented in the package). Another thing we did not take into consideration here is that we have many outcome classes and may in practice be interested in achieving class-conditional coverage. Stay tuned for more!"
+ "objectID": "blog/posts/conformal-regression/index.html#model-training-using-mlj",
+ "href": "blog/posts/conformal-regression/index.html#model-training-using-mlj",
+ "title": "Prediction Intervals for any Regression Model",
+ "section": "🏋️ Model Training using MLJ",
+ "text": "🏋️ Model Training using MLJ\nConformalPrediction.jl is interfaced to MLJ.jl (Blaom et al. 2020): a comprehensive Machine Learning Framework for Julia. MLJ.jl provides a large and growing suite of popular machine learning models that can be used for supervised and unsupervised tasks. Conformal Prediction is a model-agnostic approach to uncertainty quantification, so it can be applied to any common supervised machine learning model.\nThe interface to MLJ.jl therefore seems natural: any (supervised) MLJ.jl model can now be conformalized using ConformalPrediction.jl. By leveraging existing MLJ.jl functionality for common tasks like training, prediction and model evaluation, this package is light-weight and scalable. Now let’s see how all of that works …\nTo start with, let’s split our data into a training and test set:\n\n\nCode\ntrain, test = partition(eachindex(y), 0.4, 0.4, shuffle=true)\n\n\nNow let’s define a model for our regression task:\n\n\nCode\nModel = @load KNNRegressor pkg = NearestNeighborModels\nmodel = Model()\n\n\n\n\n\n\n\n\nHave it your way!\n\n\n\nThink this dataset is too simple? Wondering why on earth I’m not using XGBoost for this task? In the interactive version of this post you have full control over the data and the model. Try it out!\n\n\nUsing standard MLJ.jl workflows let us now first train the unconformalized model. We first wrap our model in data:\n\n\nCode\nmach_raw = machine(model, X, y)\n\n\nThen we fit the machine to the training data:\n\n\nCode\nMLJBase.fit!(mach_raw, rows=train, verbosity=0)\n\n\nFigure 2 below shows the resulting point predictions for the test data set:\n\n\n\n\n\n\n\n \n \n \n\n\n\n \n \n \n\n\n\n \n \n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nFigure 2: Point predictions for our machine learning model.\n\n\n\n\nHow is our model doing? It’s never quite right, of course, since predictions are estimates and therefore uncertain. Let’s see how we can use Conformal Prediction to express that uncertainty."
},
{
- "objectID": "blog/posts/conformal-prediction/index.html",
- "href": "blog/posts/conformal-prediction/index.html",
- "title": "Conformal Prediction in Julia 🟣🔴🟢",
+ "objectID": "blog/posts/conformal-regression/index.html#conformalizing-the-model",
+ "href": "blog/posts/conformal-regression/index.html#conformalizing-the-model",
+ "title": "Prediction Intervals for any Regression Model",
+ "section": "🔥 Conformalizing the Model",
+ "text": "🔥 Conformalizing the Model\nWe can turn our model into a conformalized model in just one line of code:\n\n\nCode\nconf_model = conformal_model(model)\n\n\nBy default conformal_model creates an Inductive Conformal Regressor (more on this below) when called on a <:Deterministic model. This behaviour can be changed by using the optional method key argument.\nTo train our conformal model we can once again rely on standard MLJ.jl workflows. We first wrap our model in data:\n\n\nCode\nmach = machine(conf_model, X, y)\n\n\nThen we fit the machine to the data:\n\n\nCode\nMLJBase.fit!(mach, rows=train, verbosity=0)\n\n\nNow let us look at the predictions for our test data again. The chart below shows the results for our conformalized model. Predictions from conformal regressors are range-valued: for each new sample the model returns an interval \\((y_{\\text{lb}},y_{\\text{ub}})\\in\\mathcal{Y}\\) that covers the test sample with a user-specified probability \\((1-\\alpha)\\), where \\(\\alpha\\) is the expected error rate. This is known as the marginal coverage guarantee and it is proven to hold under the assumption that training and test data are exchangeable.\n\n\n\n\n\n\n\n \n \n \n\n\n\n \n \n \n\n\n\n \n \n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nFigure 3: Prediction intervals for our conformalized machine learning model.\n\n\n\n\nIntuitively, a higher coverage rate leads to larger prediction intervals: since a larger interval covers a larger subspace of \\(\\mathcal{Y}\\), it is more likely to cover the true value.\nI don’t expect you to believe me that the marginal coverage property really holds. In fact, I couldn’t believe it myself when I first learned about it. If you like mathematical proofs, you can find one in this tutorial, for example. If you like convincing yourself through empirical observations, read on below …"
+ },
+ {
+ "objectID": "blog/posts/conformal-regression/index.html#evaluation",
+ "href": "blog/posts/conformal-regression/index.html#evaluation",
+ "title": "Prediction Intervals for any Regression Model",
+ "section": "🧐 Evaluation",
+ "text": "🧐 Evaluation\nTo verify the marginal coverage property empirically we can look at the empirical coverage rate of our conformal predictor (see Section 3 of the tutorial for details). To this end our package provides a custom performance measure emp_coverage that is compatible with MLJ.jl model evaluation workflows. In particular, we will call evaluate! on our conformal model using emp_coverage as our performance metric. The resulting empirical coverage rate should then be close to the desired level of coverage.\n\n\nCode\nmodel_evaluation =\n evaluate!(_mach, operation=MLJBase.predict, measure=emp_coverage, verbosity=0)\nprintln(\"Empirical coverage: $(round(model_evaluation.measurement[1], digits=3))\")\nprintln(\"Coverage per fold: $(round.(model_evaluation.per_fold[1], digits=3))\")\n\n\nEmpirical coverage: 0.909\nCoverage per fold: [0.94, 0.928, 0.892, 0.874, 0.898, 0.922]\n\n\n\n\n\n✅ ✅ ✅ Great! We got an empirical coverage rate that is slightly higher than desired 😁 … but why isn’t it exactly the same?\n\nIn most cases it will be slightly higher than desired, since \\((1-\\alpha)\\) is a lower bound. But note that it can also be slightly lower than desired. That is because the coverage property is “marginal” in the sense that the probability is averaged over the randomness in the data. For most purposes a large enough calibration set size (\\(n>1000\\)) mitigates that randomness enough. Depending on your choices above, the calibration set may be quite small (set to 500), which can lead to coverage slack (see Section 3 in the tutorial).\n\n\n\nSo what’s happening under the hood?\nInductive Conformal Prediction (also referred to as Split Conformal Prediction) broadly speaking works as follows:\n\nPartition the training into a proper training set and a separate calibration set\nTrain the machine learning model on the proper training set.\nUsing some heuristic notion of uncertainty (e.g., absolute error in the regression case), compute nonconformity scores using the calibration data and the fitted model.\nFor the given coverage ratio compute the corresponding quantile of the empirical distribution of nonconformity scores.\nFor the given quantile and test sample \\(X_{\\text{test}}\\), form the corresponding conformal prediction set like so: \\(C(X_{\\text{test}})=\\{y:s(X_{\\text{test}},y) \\le \\hat{q}\\}\\)"
+ },
+ {
+ "objectID": "blog/posts/conformal-regression/index.html#recap",
+ "href": "blog/posts/conformal-regression/index.html#recap",
+ "title": "Prediction Intervals for any Regression Model",
+ "section": "🔃 Recap",
+ "text": "🔃 Recap\nThis has been a super quick tour of ConformalPrediction.jl. We have seen how the package naturally integrates with MLJ.jl, allowing users to generate rigorous predictive uncertainty estimates for any supervised machine learning model.\n\nAre we done?\nQuite cool, right? Using a single API call we are able to generate rigorous prediction intervals for all kinds of different regression models. Have we just solved predictive uncertainty quantification once and for all? Do we even need to bother with anything else? Conformal Prediction is a very useful tool, but like so many other things, it is not the final answer to all our problems. In fact, let’s see if we can take CP to its limits.\nThe helper function to generate data from above takes an optional argument xmax. By increasing that value, we effectively expand the domain of our input. Let’s do that and see how our conformal model does on this new out-of-domain data.\n\n\n\n\n\n\n\n \n \n \n\n\n\n \n \n \n\n\n\n \n \n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nFigure 4: Prediction intervals for our conformalized machine learning model applied to out-of-domain data.\n\n\n\n\n\nWhooooops 🤕 … looks like we’re in trouble: in Figure 4 the prediction intervals do not cover out-of-domain test samples well. What happened here?\n\nBy expanding the domain of out inputs, we have violated the exchangeability assumption. When that assumption is violated, the marginal coverage property does not hold. But do not despair! There are ways to deal with this."
+ },
+ {
+ "objectID": "blog/posts/conformal-regression/index.html#read-on",
+ "href": "blog/posts/conformal-regression/index.html#read-on",
+ "title": "Prediction Intervals for any Regression Model",
+ "section": "📚 Read on",
+ "text": "📚 Read on\nIf you are curious to find out more, be sure to read on in the docs. There are also a number of useful resources to learn more about Conformal Prediction, a few of which I have listed below:\n\nA Gentle Introduction to Conformal Prediction and Distribution-Free Uncertainty Quantification by Angelopoulos and Bates (2022).\nAwesome Conformal Prediction repository by Manokhin (2022)\nMAPIE: a comprehensive Python library for conformal prediction.\nMy previous two blog posts.\n\nEnjoy!"
+ },
+ {
+ "objectID": "blog/posts/effortsless-bayesian-dl/index.html",
+ "href": "blog/posts/effortsless-bayesian-dl/index.html",
+ "title": "Go deep, but also … go Bayesian!",
"section": "",
- "text": "Prediction sets for two different samples and changing coverage rates. As coverage grows, so does the size of the prediction sets.\nA first crucial step towards building trustworthy AI systems is to be transparent about predictive uncertainty. Model parameters are random variables and their values are estimated from noisy data. That inherent stochasticity feeds through to model predictions and should to be addressed, at the very least in order to avoid overconfidence in models.\nBeyond that obvious concern, it turns out that quantifying model uncertainty actually opens up a myriad of possibilities to improve up- and down-stream modeling tasks like active learning and robustness. In Bayesian Active Learning, for example, uncertainty estimates are used to guide the search for new input samples, which can make ground-truthing tasks more efficient (Houlsby et al. 2011). With respect to model performance in downstream tasks, uncertainty quantification can be used to improve model calibration and robustness (Lakshminarayanan, Pritzel, and Blundell 2017).\nIn previous posts we have looked at how uncertainty can be quantified in the Bayesian context (see here and here). Since in Bayesian modeling we are generally concerned with estimating posterior distributions, we get uncertainty estimates almost as a byproduct. This is great for all intends and purposes, but it hinges on assumptions about prior distributions. Personally, I have no quarrel with the idea of making prior distributional assumptions. On the contrary, I think the Bayesian framework formalizes the idea of integrating prior information in models and therefore provides a powerful toolkit for conducting science. Still, in some cases this requirement may be seen as too restrictive or we may simply lack prior information.\nEnter: Conformal Prediction (CP) — a scalable frequentist approach to uncertainty quantification and coverage control. In this post we will go through the basic concepts underlying CP. A number of hands-on usage examples in Julia should hopefully help to convey some intuition and ideally attract people interested in contributing to a new and exciting open-source development."
+ "text": "A Bayesian Neural Network gradually learns.\nDeep learning has dominated AI research in recent years1 - but how much promise does it really hold? That is very much an ongoing and increasingly polarising debate that you can follow live on Twitter. On one side you have optimists like Ilya Sutskever, chief scientist of OpenAI, who believes that large deep neural networks may already be slightly conscious - that’s “may” and “slightly” and only if you just go deep enough? On the other side you have prominent skeptics like Judea Pearl who has long since argued that deep learning still boils down to curve fitting - purely associational and not even remotely intelligent (Pearl and Mackenzie 2018)."
},
{
- "objectID": "blog/posts/conformal-prediction/index.html#sec-background",
- "href": "blog/posts/conformal-prediction/index.html#sec-background",
- "title": "Conformal Prediction in Julia 🟣🔴🟢",
- "section": "📖 Background",
- "text": "📖 Background\nConformal Prediction promises to be an easy-to-understand, distribution-free and model-agnostic way to generate statistically rigorous uncertainty estimates. That’s quite a mouthful, so let’s break it down: firstly, as I will hopefully manage to illustrate in this post, the underlying concepts truly are fairly straight-forward to understand; secondly, CP indeed relies on only minimal distributional assumptions; thirdly, common procedures to generate conformal predictions really do apply almost universally to all supervised models, therefore making the framework very intriguing to the ML community; and, finally, CP does in fact come with a frequentist coverage guarantee that ensures that conformal prediction sets contain the true value with a user-chosen probability. For a formal proof of this marginal coverage property and a detailed introduction to the topic, I recommend Angelopoulos and Bates (2022).\n\n\n\n\n\n\nNote\n\n\n\nIn what follows we will loosely treat the tutorial by Angelopoulos and Bates (2022) and the general framework it sets as a reference. You are not expected to have read the paper, but I also won’t reiterate any details here.\n\n\nCP can be used to generate prediction intervals for regression models and prediction sets for classification models (more on this later). There is also some recent work on conformal predictive distributions and probabilistic predictions. Interestingly, it can even be used to complement Bayesian methods. Angelopoulos and Bates (2022), for example, point out that prior information should be incorporated into prediction sets and demonstrate how Bayesian predictive distributions can be conformalized in order to comply with the frequentist notion of coverage. Relatedly, Hoff (2021) proposes a Bayes-optimal prediction procedure. And finally, Stanton, Maddox, and Wilson (2022) very recently proposed a way to introduce conformal prediction in Bayesian Optimization. I find this type of work that combines different schools of thought very promising, but I’m drifting off a little … So, without further ado, let us look at some code."
+ "objectID": "blog/posts/effortsless-bayesian-dl/index.html#the-case-for-bayesian-deep-learning",
+ "href": "blog/posts/effortsless-bayesian-dl/index.html#the-case-for-bayesian-deep-learning",
+ "title": "Go deep, but also … go Bayesian!",
+ "section": "The case for Bayesian Deep Learning",
+ "text": "The case for Bayesian Deep Learning\nWhatever side of this entertaining twitter dispute you find yourself on, the reality is that deep-learning systems have already been deployed at large scale both in academia and industry. More pressing debates therefore revolve around the trustworthiness of these existing systems. How robust are they and in what way exactly do they arrive at decisions that affect each and every one of us? Robustifying deep neural networks generally involves some form of adversarial training, which is costly, can hurt generalization (Raghunathan et al. 2019) and does ultimately not guarantee stability (Bastounis, Hansen, and Vlačić 2021). With respect to interpretability, surrogate explainers like LIME and SHAP are among the most popular tools, but they too have been shown to lack robustness (Slack et al. 2020).\nExactly why are deep neural networks unstable and in-transparent? Let \\(\\mathcal{D}=\\{x,y\\}_{n=1}^N\\) denote our feature-label pairs and let \\(f(x;\\theta)=y\\) denote some deep neural network specified by its parameters \\(\\theta\\). Then the first thing to note is that the number of free parameters \\(\\theta\\) is typically huge (if you ask Mr Sutskever it really probably cannot be huge enough!). That alone makes it very hard to monitor and interpret the inner workings of deep-learning algorithms. Perhaps more importantly though, the number of parameters relative to the size of \\(\\mathcal{D}\\) is generally huge:\n\n[…] deep neural networks are typically very underspecified by the available data, and […] parameters [therefore] correspond to a diverse variety of compelling explanations for the data. (Wilson 2020)\n\nIn other words, training a single deep neural network may (and usually does) lead to one random parameter specification that fits the underlying data very well. But in all likelihood there are many other specifications that also fit the data very well. This is both a strength and vulnerability of deep learning: it is a strength because it typically allows us to find one such “compelling explanation” for the data with ease through stochastic optimization; it is a vulnerability because one has to wonder:\n\nHow compelling is an explanation really if it competes with many other equally compelling, but potentially very different explanations?\n\nA scenario like this very much calls for treating predictions from deep learning models probabilistically (Wilson 2020)23.\nFormally, we are interested in estimating the posterior predictive distribution as the following Bayesian model average (BMA):\n\\[\np(y|x,\\mathcal{D}) = \\int p(y|x,\\theta)p(\\theta|\\mathcal{D})d\\theta\n\\]\nThe integral implies that we essentially need many predictions from many different specifications of \\(\\theta\\). Unfortunately, this means more work for us or rather our computers. Fortunately though, researchers have proposed many ingenious ways to approximate the equation above in recent years: Gal and Ghahramani (2016) propose using dropout at test time while Lakshminarayanan, Pritzel, and Blundell (2017) show that averaging over an ensemble of just five models seems to do the trick. Still, despite their simplicity and usefulness these approaches involve additional computational costs compared to training just a single network. As we shall see now though, another promising approach has recently entered the limelight: Laplace approximation (LA).\nIf you have read my previous post on Bayesian Logistic Regression, then the term Laplace should already sound familiar to you. As a matter of fact, we will see that all concepts covered in that previous post can be naturally extended to deep learning. While some of these concepts will be revisited below, I strongly recommend you check out the previous post before reading on here. Without further ado let us now see how LA can be used for truly effortless deep learning."
},
{
- "objectID": "blog/posts/conformal-prediction/index.html#sec-julia",
- "href": "blog/posts/conformal-prediction/index.html#sec-julia",
- "title": "Conformal Prediction in Julia 🟣🔴🟢",
- "section": "📦 Conformal Prediction in Julia",
- "text": "📦 Conformal Prediction in Julia\nIn this section of this first short post on CP we will look at how conformal prediction can be implemented in Julia. In particular, we will look at an approach that is compatible with any of the many supervised machine learning models available in MLJ: a beautiful, comprehensive machine learning framework funded by the Alan Turing Institute and the New Zealand Strategic Science Investment Fund Blaom et al. (2020). We will go through some basic usage examples employing a new Julia package that I have been working on: ConformalPrediction.jl.\n\n\n\n\n\n\nConformalPrediction.jl\n\n\n\nConformalPrediction.jl is a package for uncertainty quantification through conformal prediction for machine learning models trained in MLJ. At the time of writing it is still in its early stages of development, but already implements a range of different approaches to CP. Contributions are very much welcome:\n\nDocumentation\nContributor’s Guide\n\n\n\n\nSplit Conformal Classification\nWe consider a simple binary classification problem. Let \\((X_i, Y_i), \\ i=1,...,n\\) denote our feature-label pairs and let \\(\\mu: \\mathcal{X} \\mapsto \\mathcal{Y}\\) denote the mapping from features to labels. For illustration purposes we will use the moons dataset 🌙. Using MLJ.jl we first generate the data and split into into a training and test set:\n\n\nCode\nusing MLJ\nusing Random\nRandom.seed!(123)\n\n# Data:\nX, y = make_moons(500; noise=0.15)\ntrain, test = partition(eachindex(y), 0.8, shuffle=true)\n\n\nHere we will use a specific case of CP called split conformal prediction which can then be summarized as follows:1\n\nPartition the training into a proper training set and a separate calibration set: \\(\\mathcal{D}_n=\\mathcal{D}^{\\text{train}} \\cup \\mathcal{D}^{\\text{cali}}\\).\nTrain the machine learning model on the proper training set: \\(\\hat\\mu_{i \\in \\mathcal{D}^{\\text{train}}}(X_i,Y_i)\\).\nCompute nonconformity scores, \\(\\mathcal{S}\\), using the calibration data \\(\\mathcal{D}^{\\text{cali}}\\) and the fitted model \\(\\hat\\mu_{i \\in \\mathcal{D}^{\\text{train}}}\\).\nFor a user-specified desired coverage ratio \\((1-\\alpha)\\) compute the corresponding quantile, \\(\\hat{q}\\), of the empirical distribution of nonconformity scores, \\(\\mathcal{S}\\).\nFor the given quantile and test sample \\(X_{\\text{test}}\\), form the corresponding conformal prediction set:\n\n\\[\nC(X_{\\text{test}})=\\{y:s(X_{\\text{test}},y) \\le \\hat{q}\\}\n\\tag{1}\\]\nThis is the default procedure used for classification and regression in ConformalPrediction.jl.\nYou may want to take a look at the source code for the classification case here. As a first important step, we begin by defining a concrete type SimpleInductiveClassifier that wraps a supervised model from MLJ.jl and reserves additional fields for a few hyperparameters. As a second step, we define the training procedure, which includes the data-splitting and calibration step. Finally, as a third step we implement the procedure in Equation 1 to compute the conformal prediction set.\n\n\n\n\n\n\nDevelopment Status\n\n\n\nThe permalinks above take you to the version of the package that was up-to-date at the time of writing. Since the package is in its early stages of development, the code base and API can be expected to change.\n\n\nNow let’s take this to our 🌙 data. To illustrate the package functionality we will demonstrate the envisioned workflow. We first define our atomic machine learning model following standard MLJ.jl conventions. Using ConformalPrediction.jl we then wrap our atomic model in a conformal model using the standard API call conformal_model(model::Supervised; kwargs...). To train and predict from our conformal model we can then rely on the conventional MLJ.jl procedure again. In particular, we wrap our conformal model in data (turning it into a machine) and then fit it on the training set. Finally, we use our machine to predict the label for a new test sample Xtest:\n\n\nCode\n# Model:\nKNNClassifier = @load KNNClassifier pkg=NearestNeighborModels\nmodel = KNNClassifier(;K=50) \n\n# Training:\nusing ConformalPrediction\nconf_model = conformal_model(model; coverage=.9)\nmach = machine(conf_model, X, y)\nfit!(mach, rows=train)\n\n# Conformal Prediction:\nXtest = selectrows(X, first(test))\nytest = y[first(test)]\npredict(mach, Xtest)[1]\n\n\nimport NearestNeighborModels\n\n\n ✔\n\n\nUnivariateFinite{Multiclass{2}}(0=>0.94)\n\n\nThe final predictions are set-valued. While the softmax output remains unchanged for the SimpleInductiveClassifier, the size of the prediction set depends on the chosen coverage rate, \\((1-\\alpha)\\).\n\n\nWhen specifying a coverage rate very close to one, the prediction set will typically include many (in some cases all) of the possible labels. Below, for example, both classes are included in the prediction set when setting the coverage rate equal to \\((1-\\alpha)\\)=1.0. This is intuitive, since high coverage quite literally requires that the true label is covered by the prediction set with high probability.\n\n\n\n\nCode\nconf_model = conformal_model(model; coverage=coverage)\nmach = machine(conf_model, X, y)\nfit!(mach, rows=train)\n\n# Conformal Prediction:\nXtest = (x1=[1],x2=[0])\npredict(mach, Xtest)[1]\n\n\nUnivariateFinite{Multiclass{2}}(0=>0.5, 1=>0.5)\n\n\n\n\nConversely, for low coverage rates, prediction sets can also be empty. For a choice of \\((1-\\alpha)\\)=0.1, for example, the prediction set for our test sample is empty. This is a bit difficult to think about intuitively and I have not yet come across a satisfactory, intuitive interpretation.2 When the prediction set is empty, the predict call currently returns missing:\n\n\n\n\nCode\nconf_model = conformal_model(model; coverage=coverage)\nmach = machine(conf_model, X, y)\nfit!(mach, rows=train)\n\n# Conformal Prediction:\npredict(mach, Xtest)[1]\n\n\nmissing\n\n\nFigure 1 should provide some more intuition as to what exactly is happening here. It illustrates the effect of the chosen coverage rate on the predicted softmax output and the set size in the two-dimensional feature space. Contours are overlayed with the moon data points (including test data). The two samples highlighted in red, \\(X_1\\) and \\(X_2\\), have been manually added for illustration purposes. Let’s look at these one by one.\nFirstly, note that \\(X_1\\) (red cross) falls into a region of the domain that is characterized by high predictive uncertainty. It sits right at the bottom-right corner of our class-zero moon 🌜 (orange), a region that is almost entirely enveloped by our class-one moon 🌛 (green). For low coverage rates the prediction set for \\(X_1\\) is empty: on the left-hand side this is indicated by the missing contour for the softmax probability; on the right-hand side we can observe that the corresponding set size is indeed zero. For high coverage rates the prediction set includes both \\(y=0\\) and \\(y=1\\), indicative of the fact that the conformal classifier is uncertain about the true label.\nWith respect to \\(X_2\\), we observe that while also sitting on the fringe of our class-zero moon, this sample populates a region that is not fully enveloped by data points from the opposite class. In this region, the underlying atomic classifier can be expected to be more certain about its predictions, but still not highly confident. How is this reflected by our corresponding conformal prediction sets?\n\n\nCode\nXtest_2 = (x1=[-0.5],x2=[0.25])\ncov_ = .9\nconf_model = conformal_model(model; coverage=cov_)\nmach = machine(conf_model, X, y)\nfit!(mach, rows=train)\np̂_2 = pdf(predict(mach, Xtest_2)[1], 0)\n\n\n\n\nWell, for low coverage rates (roughly \\(<0.9\\)) the conformal prediction set does not include \\(y=0\\): the set size is zero (right panel). Only for higher coverage rates do we have \\(C(X_2)=\\{0\\}\\): the coverage rate is high enough to include \\(y=0\\), but the corresponding softmax probability is still fairly low. For example, for \\((1-\\alpha)=0.9\\) we have \\(\\hat{p}(y=0|X_2)=0.72.\\)\n\n\nThese two examples illustrate an interesting point: for regions characterised by high predictive uncertainty, conformal prediction sets are typically empty (for low coverage) or large (for high coverage). While set-valued predictions may be something to get used to, this notion is overall intuitive.\n\n\nCode\n# Setup\ncoverages = range(0.75,1.0,length=5)\nn = 100\nx1_range = range(extrema(X.x1)...,length=n)\nx2_range = range(extrema(X.x2)...,length=n)\n\nanim = @animate for coverage in coverages\n conf_model = conformal_model(model; coverage=coverage)\n mach = machine(conf_model, X, y)\n fit!(mach, rows=train)\n p1 = contourf_cp(mach, x1_range, x2_range; type=:proba, title=\"Softmax\", axis=nothing)\n scatter!(p1, X.x1, X.x2, group=y, ms=2, msw=0, alpha=0.75)\n scatter!(p1, Xtest.x1, Xtest.x2, ms=6, c=:red, label=\"X₁\", shape=:cross, msw=6)\n scatter!(p1, Xtest_2.x1, Xtest_2.x2, ms=6, c=:red, label=\"X₂\", shape=:diamond, msw=6)\n p2 = contourf_cp(mach, x1_range, x2_range; type=:set_size, title=\"Set size\", axis=nothing)\n scatter!(p2, X.x1, X.x2, group=y, ms=2, msw=0, alpha=0.75)\n scatter!(p2, Xtest.x1, Xtest.x2, ms=6, c=:red, label=\"X₁\", shape=:cross, msw=6)\n scatter!(p2, Xtest_2.x1, Xtest_2.x2, ms=6, c=:red, label=\"X₂\", shape=:diamond, msw=6)\n plot(p1, p2, plot_title=\"(1-α)=$(round(coverage,digits=2))\", size=(800,300))\nend\n\ngif(anim, fps=0.5)\n\n\n\n\n\n\n\nFigure 1: The effect of the coverage rate on the conformal prediction set. Softmax probabilities are shown on the left. The size of the prediction set is shown on the right."
+ "objectID": "blog/posts/effortsless-bayesian-dl/index.html#laplace-approximation",
+ "href": "blog/posts/effortsless-bayesian-dl/index.html#laplace-approximation",
+ "title": "Go deep, but also … go Bayesian!",
+ "section": "Laplace Approximation",
+ "text": "Laplace Approximation\nWhile LA was first proposed in the 18th century, it has so far not attracted serious attention from the deep learning community largely because it involves a possibly large Hessian computation. Daxberger et al. (2021) are on a mission to change the perception that LA has no use in DL: in their NeurIPS 2021 paper they demonstrate empirically that LA can be used to produce Bayesian model averages that are at least at par with existing approaches in terms of uncertainty quantification and out-of-distribution detection and significantly cheaper to compute. They show that recent advancements in autodifferentation can be leveraged to produce fast and accurate approximations of the Hessian and even provide a fully-fledged Python library that can be used with any pretrained Torch model. For this post, I have built a much less comprehensive, pure-play equivalent of their package in Julia - LaplaceRedux.jl can be used with deep learning models built in Flux.jl, which is Julia’s main DL library. As in the previous post on Bayesian logistic regression I will rely on Julia code snippits instead of equations to convey the underlying maths. If you’re curious about the maths, the NeurIPS 2021 paper provides all the detail you need.\n\nFrom Bayesian Logistic Regression …\nLet’s recap: in the case of logistic regression we had a assumed a zero-mean Gaussian prior \\(p(\\mathbf{w}) \\sim \\mathcal{N} \\left( \\mathbf{w} | \\mathbf{0}, \\sigma_0^2 \\mathbf{I} \\right)=\\mathcal{N} \\left( \\mathbf{w} | \\mathbf{0}, \\mathbf{H}_0^{-1} \\right)\\) for the weights that are used to compute logits \\(\\mu_n=\\mathbf{w}^T\\mathbf{x}_n\\), which in turn are fed to a sigmoid function to produce probabilities \\(p(y_n=1)=\\sigma(\\mu_n)\\). We saw that under this assumption solving the logistic regression problem corresponds to minimizing the following differentiable loss function:\n\\[\n\\ell(\\mathbf{w})= - \\sum_{n}^N [y_n \\log \\mu_n + (1-y_n)\\log (1-\\mu_n)] + \\\\ \\frac{1}{2} (\\mathbf{w}-\\mathbf{w}_0)^T\\mathbf{H}_0(\\mathbf{w}-\\mathbf{w}_0)\n\\]\nAs our first step towards Bayesian deep learning, we observe the following: the loss function above corresponds to the objective faced by a single-layer artificial neural network with sigmoid activation and weight decay4. In other words, regularized logistic regression is equivalent to a very simple neural network architecture and hence it is not surprising that underlying concepts can in theory be applied in much the same way.\nSo let’s quickly recap the next core concept: LA relies on the fact that the second-order Taylor expansion of our loss function \\(\\ell\\) evaluated at the maximum a posteriori (MAP) estimate \\(\\mathbf{\\hat{w}}=\\arg\\max_{\\mathbf{w}} p(\\mathbf{w}|\\mathcal{D})\\) amounts to a multi-variate Gaussian distribution. In particular, that Gaussian is centered around the MAP estimate with covariance equal to the inverse Hessian evaluated at the mode \\(\\hat{\\Sigma}=(\\mathbf{H}(\\mathbf{\\hat{w}}))^{-1}\\) (Murphy 2022).\nThat is basically all there is to the story: if we have a good estimate of \\(\\mathbf{H}(\\mathbf{\\hat{w}})\\) we have an analytical expression for an (approximate) posterior over parameters. So let’s go ahead and start by run Bayesian Logistic regression using Flux.jl. We begin by loading some required packages including LaplaceRedux.jl. It ships with a helper function toy_data_linear that creates a toy data set composed of linearly separable samples evenly balanced across the two classes.\n\n\nCode\n# Import libraries.\nusing Flux, Plots, Random, PlotThemes, Statistics, LaplaceRedux\ntheme(:wong)\n# Number of points to generate.\nxs, y = toy_data_linear(100)\nX = hcat(xs...); # bring into tabular format\ndata = zip(xs,y);\n\n\nThen we proceed to prepare the single-layer neural network with weight decay. The term \\(\\lambda\\) determines the strength of the \\(\\ell2\\) penalty: we regularize parameters \\(\\theta\\) more heavily for higher values. Equivalently, we can say that from the Bayesian perspective it governs the strength of the prior \\(p(\\theta) \\sim \\mathcal{N} \\left( \\theta | \\mathbf{0}, \\sigma_0^2 \\mathbf{I} \\right)= \\mathcal{N} \\left( \\mathbf{w} | \\mathbf{0}, \\lambda_0^{-2} \\mathbf{I} \\right)\\): a higher value of \\(\\lambda\\) indicates a higher conviction about our prior belief that \\(\\theta=\\mathbf{0}\\), which is of course equivalent to regularizing more heavily. The exact choice of \\(\\lambda=0.5\\) for this toy example is somewhat arbitrary (it made for good visualizations below). Note that I have used \\(\\theta\\) to denote our neural parameters to distinguish the case from Bayesian logistic regression, but we are in fact still solving the same problem.\n\n\nCode\nnn = Chain(Dense(2,1))\nλ = 0.5\nsqnorm(x) = sum(abs2, x)\nweight_regularization(λ=λ) = 1/2 * λ^2 * sum(sqnorm, Flux.params(nn))\nloss(x, y) = Flux.Losses.logitbinarycrossentropy(nn(x), y) + weight_regularization();\n\n\nBefore we apply Laplace approximation we train our model:\n\n\nCode\nusing Flux.Optimise: update!, ADAM\nopt = ADAM()\nepochs = 50\n\nfor epoch = 1:epochs\n for d in data\n gs = gradient(params(nn)) do\n l = loss(d...)\n end\n update!(opt, params(nn), gs)\n end\nend\n\n\nUp until this point we have just followed the standard recipe for training a regularized artificial neural network in Flux.jl for a simple binary classification task. To compute the Laplace approximation using LaplaceRedux.jl we need just two more lines of code:\n\n\nCode\nla = laplace(nn, λ=λ)\nfit!(la, data);\n\n\nUnder the hood the Hessian is approximated through the empirical Fisher, which can be computed using only the gradients of our loss function \\(\\nabla_{\\theta}\\ell(f(\\mathbf{x}_n;\\theta,y_n))\\) where \\(\\{\\mathbf{x}_n,y_n\\}\\) are training data (see NeurIPS 2021 paper for details). Finally, LaplaceRedux.jl ships with a function predict(𝑳::LaplaceRedux, X::AbstractArray; link_approx=:probit) that computes the posterior predictive using a probit approximation, much like we saw in the previous post. That function is used under the hood of the plot_contour function below to create the right panel of Figure 1. It visualizes the posterior predictive distribution in the 2D feature space. For comparison I have added the corresponding plugin estimate as well. Note how for the Laplace approximation the predicted probabilities fan out indicating that confidence decreases in regions scarce of data.\n\n\nCode\np_plugin = plot_contour(X',y,la;title=\"Plugin\",type=:plugin);\np_laplace = plot_contour(X',y,la;title=\"Laplace\")\n# Plot the posterior distribution with a contour plot.\nplt = plot(p_plugin, p_laplace, layout=(1,2), size=(1000,400))\nsavefig(plt, \"www/posterior_predictive_logit.png\");\n\n\n\n\n\n\n\n\nFigure 1: Posterior predictive distribution of Logistic regression in the 2D feature space using plugin estimator (left) and Laplace approximation (right).\n\n\n\n\n\n… to Bayesian Neural Networks\nNow let’s step it up a notch: we will repeat the exercise from above, but this time for data that is not linearly separable using a simple MLP instead of the single-layer neural network we used above. The code below is almost the same as above, so I will not go through the various steps again.\n\n\nCode\n# Number of points to generate:\nxs, y = toy_data_non_linear(200)\nX = hcat(xs...); # bring into tabular format\ndata = zip(xs,y)\n\n# Build MLP:\nn_hidden = 32\nD = size(X)[1]\nnn = Chain(\n Dense(D, n_hidden, σ),\n Dense(n_hidden, 1)\n) \nλ = 0.01\nsqnorm(x) = sum(abs2, x)\nweight_regularization(λ=λ) = 1/2 * λ^2 * sum(sqnorm, Flux.params(nn))\nloss(x, y) = Flux.Losses.logitbinarycrossentropy(nn(x), y) + weight_regularization()\n\n# Training:\nepochs = 200\nfor epoch = 1:epochs\n for d in data\n gs = gradient(params(nn)) do\n l = loss(d...)\n end\n update!(opt, params(nn), gs)\n end\nend\n\n\nFitting the Laplace approximation is also analogous, but note that this we have added an argument: subset_of_weights=:last_layer. This specifies that we only want to use the parameters of the last layer of our MLP. While we could have used all of them (subset_of_weights=:all), Daxberger et al. (2021) find that the last-layer Laplace approximation produces satisfying results, while be computationally cheaper. Figure 2 demonstrates that once again the Laplace approximation yields a posterior predictive distribution that is more conservative than the over-confident plugin estimate.\n\n\nCode\nla = laplace(nn, λ=λ, subset_of_weights=:last_layer)\nfit!(la, data);\np_plugin = plot_contour(X',y,la;title=\"Plugin\",type=:plugin)\np_laplace = plot_contour(X',y,la;title=\"Laplace\")\n# Plot the posterior distribution with a contour plot.\nplt = plot(p_plugin, p_laplace, layout=(1,2), size=(1000,400))\nsavefig(plt, \"www/posterior_predictive_mlp.png\");\n\n\n\n\n\n\n\n\nFigure 2: Posterior predictive distribution of MLP in the 2D feature space using plugin estimator (left) and Laplace approximation (right).\n\n\n\nTo see why this is a desirable outcome consider the zoomed out version of Figure 2 below: the plugin estimator classifies with full confidence in regions completely scarce of any data. Arguably Laplace approximation produces a much more reasonable picture, even though it too could likely be improved by fine-tuning our choice of \\(\\lambda\\) and the neural network architecture.\n\n\nCode\nzoom=-50\np_plugin = plot_contour(X',y,la;title=\"Plugin\",type=:plugin,zoom=zoom);\np_laplace = plot_contour(X',y,la;title=\"Laplace\",zoom=zoom);\n# Plot the posterior distribution with a contour plot.\nplt = plot(p_plugin, p_laplace, layout=(1,2), size=(1000,400));\nsavefig(plt, \"www/posterior_predictive_mlp_zoom.png\");\n\n\n\n\n\n\n\n\nFigure 3: Posterior predictive distribution of MLP in the 2D feature space using plugin estimator (left) and Laplace approximation (right). Zoomed out."
},
{
- "objectID": "blog/posts/conformal-prediction/index.html#conclusion",
- "href": "blog/posts/conformal-prediction/index.html#conclusion",
- "title": "Conformal Prediction in Julia 🟣🔴🟢",
- "section": "🏁 Conclusion",
- "text": "🏁 Conclusion\nThis has really been a whistle-stop tour of Conformal Prediction: an active area of research that probably deserves much more attention. Hopefully, though, this post has helped to provide some color and, if anything, made you more curious about the topic. Let’s recap the TL;DR from above:\n\nConformal Prediction is an interesting frequentist approach to uncertainty quantification that can even be combined with Bayes (Section 1).\nIt is scalable and model-agnostic and therefore well applicable to machine learning (Section 1).\nConformalPrediction.jl implements CP in pure Julia and can be used with any supervised model available from MLJ.jl (Section 2).\nImplementing CP directly on top of an existing, powerful machine learning toolkit demonstrates the potential usefulness of this framework to the ML community (Section 2).\nStandard conformal classifiers produce set-valued predictions: for ambiguous samples these sets are typically large (for high coverage) or empty (for low coverage) (Section 2.1).\n\nBelow I will leave you with some further resources."
+ "objectID": "blog/posts/effortsless-bayesian-dl/index.html#wrapping-up",
+ "href": "blog/posts/effortsless-bayesian-dl/index.html#wrapping-up",
+ "title": "Go deep, but also … go Bayesian!",
+ "section": "Wrapping up",
+ "text": "Wrapping up\nRecent state-of-the-art research on neural information processing suggests that Bayesian deep learning can be effortless: Laplace approximation for deep neural networks appears to work very well and it does so at minimal computational cost (Daxberger et al. 2021). This is great news, because the case for turning Bayesian is strong: society increasingly relies on complex automated decision-making systems that need to be trustworthy. More and more of these systems involve deep learning which in and of itself is not trustworthy. We have seen that typically there exist various viable parameterizations of deep neural networks each with their own distinct and compelling explanation for the data at hand. When faced with many viable options, don’t put all of your eggs in one basket. In other words, go Bayesian!"
},
{
- "objectID": "blog/posts/conformal-prediction/index.html#further-resources",
- "href": "blog/posts/conformal-prediction/index.html#further-resources",
- "title": "Conformal Prediction in Julia 🟣🔴🟢",
- "section": "📚 Further Resources",
- "text": "📚 Further Resources\nChances are that you have already come across the Awesome Conformal Prediction repo: Manokhin (2022) provides a comprehensive, up-to-date overview of resources related to the conformal prediction. Among the listed articles you will also find Angelopoulos and Bates (2022), which inspired much of this post. The repo also points to open-source implementations in other popular programming languages including Python and R."
+ "objectID": "blog/posts/effortsless-bayesian-dl/index.html#resources",
+ "href": "blog/posts/effortsless-bayesian-dl/index.html#resources",
+ "title": "Go deep, but also … go Bayesian!",
+ "section": "Resources",
+ "text": "Resources\nTo get started with Bayesian deep learning I have found many useful and free resources online, some of which are listed below:\n\nTuring.jl tutorial on Bayesian deep learning in Julia.\nVarious RStudio AI blog posts including this one and this one.\nTensorFlow blog post on regression with probabilistic layers.\nKevin Murphy’s draft text book, now also available as print."
},
{
- "objectID": "blog/posts/conformal-prediction/index.html#footnotes",
- "href": "blog/posts/conformal-prediction/index.html#footnotes",
- "title": "Conformal Prediction in Julia 🟣🔴🟢",
+ "objectID": "blog/posts/effortsless-bayesian-dl/index.html#footnotes",
+ "href": "blog/posts/effortsless-bayesian-dl/index.html#footnotes",
+ "title": "Go deep, but also … go Bayesian!",
"section": "Footnotes",
- "text": "Footnotes\n\n\nIn other places split conformal prediction is sometimes referred to as inductive conformal prediction.↩︎\nAny thoughts/comments welcome!↩︎"
+ "text": "Footnotes\n\n\nSee for example this article in the MIT Technology Review↩︎\nIn fact, not treating probabilistic deep learning models as such is sheer madness because remember that the underlying parameters \\(\\theta\\) are random variables. Frequentists and Bayesians alike will tell you that relying on a single point estimate of random variables is just nuts!↩︎\nProponents of Causal AI like Judea Pearl would argue that the Bayesian treatment still does not go far enough: in their view model explanations can only be truly compelling if they are causally found.↩︎\nSee this answer on Stack Exchange for a detailed discussion.↩︎"
},
{
- "objectID": "blog/posts/guest-students-laplace/index.html",
- "href": "blog/posts/guest-students-laplace/index.html",
- "title": "Paving the Way Towards Low-Overhead Uncertainty Calibration",
+ "objectID": "blog/posts/a-new-tool-for-explainable-ai/index.html",
+ "href": "blog/posts/a-new-tool-for-explainable-ai/index.html",
+ "title": "A new tool for explainable AI",
"section": "",
- "text": "Guest Blog Post\n\n\n\nThis blog post was originally written by Severin Bratus and colleagues from TU Delft and published on Medium. This version of the post includes only minor edits. If you would like to contribute a guest blog post, please get in touch.\nThis post summarizes a quarter-long second-year BSc coursework project at TU Delft. Our team of five students has made multiple improvements to LaplaceRedux.jl, due to Patrick Altmeyer. Inspired by its Pythonic counterpart, laplacet-torch, this Julia library aims to provide low-overhead Bayesian uncertainty calibration to deep neural networks via Laplace Approximations (Daxberger et al. 2021).\nWe will begin by demystifying the technical terms in the last sentence, in order to explain our contributions to the library and highlight some impressions from the experience. Note that our team has begun working on this PhD-tier subject only having had some introductory courses on probability and statistics, machine learning, and computational intelligence, without any prior exposure to Julia."
+ "text": "Turning a 9 (nine) into a 4 (four).\nCounterfactual explanations, which I introduced in one of my previous posts1, offer a simple and intuitive way to explain black-box models without opening them. Still, as of today there exists only one open-source library that provides a unifying approach to generate and benchmark counterfactual explanations for models built and trained in Python (Pawelczyk et al. 2021). This is great, but of limited use to users of other programming languages 🥲.\nEnter CounterfactualExplanations.jl: a Julia package that can be used to explain machine learning algorithms developed and trained in Julia, Python and R. Counterfactual explanations fall into the broader category of explainable artificial intelligence (XAI).\nExplainable AI typically involves models that are not inherently interpretable but require additional tools to be explainable to humans. Examples of the latter include ensembles, support vector machines and deep neural networks. This is not to be confused with interpretable AI, which involves models that are inherently interpretable and transparent such as general additive models (GAM), decision trees and rule-based models.\nSome would argue that we best avoid explaining black-box models altogether (Rudin 2019) and instead focus solely on interpretable AI. While I agree that initial efforts should always be geared towards interpretable models, stopping there would entail missed opportunities and anyway is probably not very realistic in times of DALL\\(\\cdot\\)E and Co.\nThis post introduces the main functionality of the new Julia package. Following a motivating example using a model trained in Julia, we will see how easy the package can be adapted to work with models trained in Python and R. Since the motivation for this post is also to hopefully attract contributors, the final section outlines some of the exciting developments we have planned."
},
{
- "objectID": "blog/posts/guest-students-laplace/index.html#bayesian-learning",
- "href": "blog/posts/guest-students-laplace/index.html#bayesian-learning",
- "title": "Paving the Way Towards Low-Overhead Uncertainty Calibration",
- "section": "Bayesian Learning",
- "text": "Bayesian Learning\nUncertainty calibration remains a crucial issue in safety-critical applications of modern AI, as, for instance, in autonomous driving. You would want your car autopilot not only to make accurate predictions but also to indicate when a model prediction is uncertain, to give control back to the human driver.\nA model is well-calibrated if the confidence of a prediction matches its true error rate. Note that you can have well-fit models that are badly calibrated, and vice versa (just like in life, you meet smart people, yet annoyingly arrogant).\nThe standard deep learning training process of gradient descent converges at a weight configuration that minimizes the loss function. The model obtained may be great, yet it is only a point estimate of what the weight parameters should look like.\nHowever, with the sheer immensity of the weight space, neural networks are probably underspecified by the data (or, overfit). As neural networks can approximate highly complex functions, many weight configurations would yield roughly the same training loss, yet with varying abilities to generalize outside the training dataset. This is why there are so many regularization methods out there, to keep the models simpler. One radical, yet effective approach is described by LeCun, Denker, and Solla (1989):\n\n… it is possible to take a perfectly reasonable network, delete half (or more) of the weights and wind up with a network that works just as well, or better.\n\n\n\n\n\n\n\nFigure 1: The loss landscape. One can imagine gradient descent as a particle, let’s say a ball, or a grain of sand, rolling to the bottom of a pit. Then for Bayesian Learning, we have as if a pile of sand poured around at that bottom point, with the pile being thicker where loss is lower. This proverbial sand pile would represent the posterior parameter distribution. Figure due to Amini et al. (2019)\n\n\n\nThe way gradient is usually illustrated is with a picture like the one shown in Figure 1 above a curved terrain of the loss function across the parameter space. Each point of the horizontal plane corresponds to some configuration of parameters. Gradient descent seeks the point at the bottom of this terrain, as the point with the lowest loss, however as the loss-curvature is highly non-convex and high-dimensional there are many directions in which we could move and still maintain a low loss. Thus instead of a singular point we would like to specify a probability distribution around that optimal point. Bayesian methods, and in particular Laplace Approximations, allow us to do this!\nFirstly, the Bayesian approach to neural network uncertainty calibration is that of modelling the posterior using Bayes’ Theorem:\n\\[\np(\\theta \\mid \\mathcal{D}) = \\tfrac{1}{Z} \\,p(\\mathcal{D} \\mid \\theta) \\, p(\\theta), \\qquad Z:= p(\\mathcal{D}) = \\textstyle\\int p(\\mathcal{D} \\mid \\theta) \\, p(\\theta) \\,d\\theta\n\\]\nHere \\(p(\\mathcal{D} \\mid \\theta)\\) is the likelihood of the data given by the parameters \\(\\theta\\). The prior distribution \\(p(\\theta)\\) specifies our beliefs about what the model parameters would be prior to observing the data. Finally, the intractable constant \\(Z\\) is called the evidence: it characterizes the probability of observing \\(\\mathcal{D}\\) as a whole, across all possible parameter settings (see here for details).\nFor models returning a probability distribution (e.g. classifiers), the loss is commonly defined as the negative log-likelihood. Thus if gradient descent minimizes loss, it maximizes the likelihood, producing the maximum likelihood estimate (MLE), which (assuming a uniform prior) also maximizes the posterior. This is why we call this point the maximum a posteriori, or the MAP. It makes sense to model this point as the mode of the posterior distribution, which could, for example, be a normal Gaussian distribution (see also the introductory post on this blog)."
+ "objectID": "blog/posts/a-new-tool-for-explainable-ai/index.html#counterfactuals-for-image-data",
+ "href": "blog/posts/a-new-tool-for-explainable-ai/index.html#counterfactuals-for-image-data",
+ "title": "A new tool for explainable AI",
+ "section": "Counterfactuals for image data 🖼",
+ "text": "Counterfactuals for image data 🖼\nTo introduce counterfactual explanations I used a simple binary classification problem in my previous post. It involved a linear classifier and a linearly separable, synthetic data set with just two features. This time we are going to step it up a notch: we will generate counterfactual explanations MNIST data. The MNIST dataset contains 60,000 training samples of handwritten digits in the form of 28x28 pixel grey-scale images (LeCun 1998). Each image is associated with a label indicating the digit (0-9) that the image represents.\nThe CounterfactualExplanations.jl package ships with two black-box models that were trained to predict labels for this data: firstly, a simple multi-layer perceptron (MLP) and, secondly, a corresponding deep ensemble. Originally proposed by Lakshminarayanan, Pritzel, and Blundell (2017), deep ensembles are really just ensembles of deep neural networks. They are still among the most popular approaches to Bayesian deep learning.2\n\nBlack-box models\nThe code below loads relevant packages along with the MNIST data and pre-trained models.\n\n\nCode\n# Load package, models and data:\nusing CounterfactualExplanations, Flux\nusing CounterfactualExplanations.Data: mnist_data, mnist_model, mnist_ensemble\ndata, X, ys = mnist_data()\nmodel = mnist_model()\nensemble = mnist_ensemble()\ncounterfactual_data = CounterfactualData(X,ys;domain=(0,1))\n\n\nWhile the package can currently handle a few simple classification models natively, it is designed to be easily extensible through users and contributors. Extending the package to deal with custom models typically involves only two simple steps:\n\nSubtyping: the custom model needs to be declared as a subtype of the package-internal type AbstractFittedModel.\nMultiple dispatch: the package-internal functions logits and probs need to be extended through custom methods for the new model type.\n\nThe following code implements these two steps first for the MLP and then for the deep ensemble.\n\n\nCode\nusing CounterfactualExplanations.Models\nimport CounterfactualExplanations.Models: logits, probs\n# MLP:\n# Step 1)\nstruct NeuralNetwork <: Models.AbstractFittedModel\n model::Any\nend\n# Step 2)\nlogits(M::NeuralNetwork, X::AbstractArray) = M.model(X)\nprobs(M::NeuralNetwork, X::AbstractArray)= softmax(logits(M, X))\nM = NeuralNetwork(model)\n\n# Deep ensemble:\nusing Flux: stack\n# Step 1)\nstruct FittedEnsemble <: Models.AbstractFittedModel\n ensemble::AbstractArray\nend\n# Step 2)\nusing Statistics\nlogits(M::FittedEnsemble, X::AbstractArray) = mean(stack([m(X) for m in M.ensemble],3),dims=3)\nprobs(M::FittedEnsemble, X::AbstractArray) = mean(stack([softmax(m(X)) for m in M.ensemble],3),dims=3)\nM_ensemble = FittedEnsemble(ensemble)\n\n\n\n\nCounterfactual generators\nNext, we need to specify the counterfactual generators we want to use. The package currently ships with two default generators that both need gradient access: firstly, the generic generator introduced by Wachter, Mittelstadt, and Russell (2017) and, secondly, a greedy generator introduced by Schut et al. (2021).\nThe greedy generator is designed to be used with models that incorporate uncertainty in their predictions such as the deep ensemble introduced above. It works for probabilistic (Bayesian) models, because they only produce high-confidence predictions in regions of the feature domain that are populated by training samples. As long as the model is expressive enough and well-specified, counterfactuals in these regions will always be realistic and unambiguous since by construction they should look very similar to training samples. Other popular approaches to counterfactual explanations like REVISE (Joshi et al. 2019) and CLUE (Antorán et al. 2020) also play with this simple idea.\nThe following code instantiates the two generators for the problem at hand.\n\n\nCode\ngeneric = GenericGenerator(;loss=:logitcrossentropy)\ngreedy = GreedyGenerator(;loss=:logitcrossentropy)\n\n\n\n\nExplanations\nOnce the model and counterfactual generator are specified, running counterfactual search is very easy using the package. For a given factual (x), target class (target) and data set (counterfactual_data), simply running\n\ngenerate_counterfactual(x, target, counterfactual_data, M, generic)\n\nwill generate the results, in this case using the generic generator (generic) for the MLP (M). Since we have specified two different black-box models and two different counterfactual generators, we have four combinations of a model and a generator in total. For each of these combinations I have used the generate_counterfactual function to produce the results in Figure 1.\nIn every case the desired label switch is in fact achieved, but arguably from a human perspective only the counterfactuals for the deep ensemble look like a four. The generic generator produces mild perturbations in regions that seem irrelevant from a human perspective, but nonetheless yields a counterfactual that can pass as a four. The greedy approach clearly targets pixels at the top of the handwritten nine and yields the best result overall. For the non-Bayesian MLP, both the generic and the greedy approach generate counterfactuals that look much like adversarial examples: they perturb pixels in seemingly random regions on the image.\n\n\n\n\n\n\nFigure 1: Counterfactual explanations for MNIST: turning a nine (9) into a four (4)."
},
{
- "objectID": "blog/posts/guest-students-laplace/index.html#laplace-approximations",
- "href": "blog/posts/guest-students-laplace/index.html#laplace-approximations",
- "title": "Paving the Way Towards Low-Overhead Uncertainty Calibration",
- "section": "Laplace Approximations",
- "text": "Laplace Approximations\nWe do this by a simple-yet-smart trick introduced back in the late 18th century by Pierre-Simon Laplace, the self-proclaimed “greatest French mathematician of his time”. In general, the Laplace Approximation (LA) aims to find a Gaussian approximation to a probability density (in our case, the posterior) defined over a set of continuous variables (in our case, the weights) (Bishop 2006). We can then estimate the loss (negative log-likelihood) as its second-order Taylor expansion:\n\\[\n\\mathcal{L}(\\mathcal{D}; \\theta) \\approx \\mathcal{L}(\\mathcal{D}; \\theta_\\text{MAP}) + \\tfrac{1}{2} (\\theta - \\theta_\\text{MAP})^\\intercal \\left( \\nabla^2 _\\theta \\mathcal{L}(\\mathcal{D}; \\theta) \\vert_{\\theta_\\text{MAP}} \\right)(\\theta - \\theta_\\text{MAP})\n\\]\nNote that the first-order Taylor term vanishes at the MAP since it contains the gradient, and the gradient is zero at MAP, since MAP is a maximum, by definition. What remains is the constant (zeroth-order) term, and the second-order term, containing the Hessian, which is a matrix of partial second-order derivatives.\nThen from this approximation, we can derive the long-sought multivariate normal distribution with the MAP as the mean, and the inverted Hessian as the covariance:\n\\[\np(\\theta \\mid \\mathcal{D}) \\approx N(\\theta; \\theta_\\text{MAP}, \\varSigma) \\qquad\\text{with}\\qquad \\varSigma := \\left( \\nabla^2_\\theta \\mathcal{L}(\\mathcal{D};\\theta) \\vert_{\\theta_\\text{MAP}} \\right)^{-1}\n\\]\nThe evidence \\(Z\\) is now also tractably approximated in closed form, allowing us to apply the Bayes’ theorem, to obtain the posterior distribution \\(p(\\theta \\mid \\mathcal{D})\\). We can then express the posterior predictive distribution, for an input \\(x_*\\), prediction \\(f(x_*)\\), to obtain the probability for an output \\(y\\).\nThe evidence \\(Z\\) is now also tractably approximated in closed form, allowing us to apply the Bayes’ theorem, to obtain the posterior distribution \\(p(\\theta \\mid \\mathcal{D})\\). We can then express the posterior predictive distribution, to obtain the probability for an output \\(y\\), given a prediction \\(f(x_*)\\) for an input \\(x_*\\).\n\\[\np(y \\mid f(x_*), \\mathcal{D}) = \\int p(y \\mid f_\\theta(x_*)) \\, p(\\theta \\mid \\mathcal{D}) \\,d\\theta\n\\]\nThis is what we are really after, after all — instead of giving one singular point-estimate prediction \\(\\widehat{y} = f(x_*)\\), we make the neural network give a distribution over \\(y\\).\nHowever, since the Hessian, a square matrix, defines the covariance between all model parameters (upon inversion), of which there may be millions or billions, the computation and storage of the Hessian (not to speak of inversion!) become intractable, as its size scales quadratically with the number of parameters involved. Thus to apply Laplace approximations to large models, we must make some simplifications — which brings us to…"
+ "objectID": "blog/posts/a-new-tool-for-explainable-ai/index.html#language-interoperability",
+ "href": "blog/posts/a-new-tool-for-explainable-ai/index.html#language-interoperability",
+ "title": "A new tool for explainable AI",
+ "section": "Language interoperability 👥",
+ "text": "Language interoperability 👥\nThe Julia language offers unique support for programming language interoperability. For example, calling R or Python is made remarkably easy through RCall.jl and PyCall.jl, respectively. This functionality can be leveraged to use CounterfactualExplanations.jl to generate explanations for models that were developed in other programming languages. At this time there is no native support for foreign programming languages, but the following example involving a torch neural network trained in R demonstrates how versatile the package is.3\n\nExplaining a torch model\nWe will consider a simple MLP trained for a binary classification task. As before we first need to adapt this custom model for use with our package. The code below the two necessary steps - sub-typing and method extension. Logits are returned by the torch model and copied from the R environment into the Julia scope. Probabilities are then computed inside the Julia scope by passing the logits through the sigmoid function.\n\n\nCode\nusing Flux\nusing CounterfactualExplanations, CounterfactualExplanations.Models\nimport CounterfactualExplanations.Models: logits, probs # import functions in order to extend\n\n# Step 1)\nstruct TorchNetwork <: Models.AbstractFittedModel\n nn::Any\nend\n\n# Step 2)\nfunction logits(M::TorchNetwork, X::AbstractArray)\n nn = M.nn\n y = rcopy(R\"as_array($nn(torch_tensor(t($X))))\")\n y = isa(y, AbstractArray) ? y : [y]\n return y'\nend\nfunction probs(M::TorchNetwork, X::AbstractArray)\n return σ.(logits(M, X))\nend\nM = TorchNetwork(R\"model\")\n\n\nCompared to models trained in Julia, we need to do a little more work at this point. Since our counterfactual generators need gradient access, we essentially need to allow our package to communicate with the R torch library. While this may sound daunting, it turns out to be quite manageable: all we have to do is respecify the function that computes the gradient with respect to the counterfactual loss function so that it can deal with the TorchNetwork type we defined above. That is all the adjustment needed to use CounterfactualExplanations.jl for our custom R model. Figure 2 shows a counterfactual path for a randomly chosen sample with respect to the MLP trained in R.\n\n\n\n\n\n\nExperimental functionality\n\n\n\nYou may have stumbled across the term respecify above: does it really seem like a good idea to just replace an existing function from our package? Surely not! There are certainly better ways to go about this, which we will consider when adding native support for Python and R models in future package releases. Which brings us to our final section …\n\n\n\n\nCode\nimport CounterfactualExplanations.Generators: ∂ℓ\nusing LinearAlgebra\n\n# Countefactual loss:\nfunction ∂ℓ(\n generator::AbstractGradientBasedGenerator, \n counterfactual_state::CounterfactualState) \n M = counterfactual_state.M\n nn = M.nn\n x′ = counterfactual_state.x′\n t = counterfactual_state.target_encoded\n R\"\"\"\n x <- torch_tensor($x′, requires_grad=TRUE)\n output <- $nn(x)\n loss_fun <- nnf_binary_cross_entropy_with_logits\n obj_loss <- loss_fun(output,$t)\n obj_loss$backward()\n \"\"\"\n grad = rcopy(R\"as_array(x$grad)\")\n return grad\nend\n\n\n\n\n\n\n\n\nFigure 2: Counterfactual path using the generic counterfactual generator for a model trained in R."
},
{
- "objectID": "blog/posts/guest-students-laplace/index.html#hessian-approximations",
- "href": "blog/posts/guest-students-laplace/index.html#hessian-approximations",
- "title": "Paving the Way Towards Low-Overhead Uncertainty Calibration",
- "section": "Hessian approximations",
- "text": "Hessian approximations\nMultiple techniques to approximate the Hessian have arisen from a field adjacent, yet distinct from Bayesian learning — that of second-order optimization, where Hessians are used to accelerate gradient descent convergence.\nOne such approximation is the Fisher information matrix, or simply the Fisher:\n\\[\nF := \\textstyle\\sum_{n=1}^N \\mathbb{E}_{\\widehat{y} \\sim p(y \\mid f_\\theta(x_n))} \\left[ gg^\\intercal \\right] \\quad\\text{with}\\quad g = \\nabla_\\theta \\log p(\\widehat{y} \\mid f_\\theta(x_n)) \\large\\vert_{\\theta_\\text{MAP}}\n\\]\nNote that if instead of sampling the prediction \\(\\widehat{y} ~ p(y \\mid f(x_n))\\) from the model-defined distribution, we take the actual training-set label \\(y_n\\), the resulting matrix is called the empirical Fisher, which is distinct from the Fisher, yet aligns with it under some conditions, and does not generally capture second-order information. See Kunstner et al. (2019) for an excellent discussion on the distinction.\nInstead of the Fisher, one can use the Generalized Gauss-Newton (GGN):\n\\[\nG := \\textstyle\\sum_{n=1}^N J(x_n) \\left( \\nabla^2_{f} \\log p(y_n \\mid f) \\Large\\vert_{f=f_{\\theta_\\text{map}}(x_n)} \\right) J(x_n)^\\intercal\n\\text{with}\\qquad J(x_n) := \\nabla_\\theta f_\\theta(x_n) \\vert_{\\theta_\\text{map}}\n\\]\nHere \\(J(x_n)\\) represents the Jacobian of the model output w.r.t. the parameters. The middle factor \\(\\nabla^2 …\\) is a Hessian of log-likelihood of \\(y_n\\) w.r.t. model output. Note that the model does not necessarily output ready target probabilities — for instance, classifiers output logits, values that define a probability distribution only after the application of the soft-max.\nUnlike the Fisher, GGN does not require the network to define a probabilistic model on its output (Botev, Ritter, and Barber 2017). For models defining an exponential family distribution over the output, the two coincide (Kunstner, Balles, and Hennig 2020). This applies to classifiers since they define a categorical distribution over the output, but not to simple regression models.\nThese matrices are quadratically large, it is infeasible to store them in full. The simplest estimation is to model the matrix as a diagonal — however one can easily contemplate how crude this approximation can be: for 100 parameters, only 1% of the full Hessian is captured.\nA more sophisticated approach, due to Martens and Grosse (2015), is inspired by the observation that in practice the covariance matrices (i.e. inverted Hessians) for neural networks are block-diagonal-dominant. Thus we can effectively model the covariance matrix (and hence the Fisher) as a block-diagonal matrix, where blocks correspond to parameters grouped by layers. Additionally, each block is decomposed into two Kronecker factors, reducing the size of data stored several magnitudes more, at a cost of another assumption.\nLastly, a novel approach is to sketch a low-rank approximation of the Fisher (Sharma, Azizan, and Pavone 2021). Figure 2 shows four Hessian approximation structures:\n\n\n\n\n\n\nFigure 2: (a) Hessian in full, intractable for large networks. (b) Low-rank. (c) Kronecker-factored Approximate Curvature, a block-diagonal method. (d) Diagonal. Source: Daxberger et al. (2021)\n\n\n\nIt is also possible to cut the costs by treating only a subset of the model parameters, i.e. a subnetwork, probabilistically, fixing the remaining parameters at their MAP-estimated values. One special case of subnetwork Laplace that was found to perform well in practice is last-layer Laplace, where the selected subnetwork contains only the weights and biases of the last layer."
+ "objectID": "blog/posts/a-new-tool-for-explainable-ai/index.html#we-need-you",
+ "href": "blog/posts/a-new-tool-for-explainable-ai/index.html#we-need-you",
+ "title": "A new tool for explainable AI",
+ "section": "We need you! 🫵",
+ "text": "We need you! 🫵\nThe ambition for CounterfactualExplanations.jl is to provide a go-to place for counterfactual explanations to the Julia community and beyond. This is a grand ambition, especially for a package that has so far been built by a single developer who has little prior experience with Julia. We would therefore very much like to invite community contributions. If you have an interest in trustworthy AI, the open-source community and Julia, please do get involved! This package is still in its early stages of development, so any kind of contribution is welcome: advice on the core package architecture, pull requests, issues, discussions and even just comments below would be much appreciated.\nTo give you a flavor of what type of future developments we envision, here is a non-exhaustive list:\n\nNative support for additional counterfactual generators and predictive models including those built and trained in Python or R.\nAdditional datasets for testing, evaluation and benchmarking.\nImproved preprocessing including native support for categorical features.\nSupport for regression models.\n\nFinally, if you like this project but don’t have much time, then simply sharing this article or starring the repo on GitHub would also go a long way."
},
{
- "objectID": "blog/posts/guest-students-laplace/index.html#our-contributions-to-laplaceredux.jl",
- "href": "blog/posts/guest-students-laplace/index.html#our-contributions-to-laplaceredux.jl",
- "title": "Paving the Way Towards Low-Overhead Uncertainty Calibration",
- "section": "Our contributions to LaplaceRedux.jl",
- "text": "Our contributions to LaplaceRedux.jl\nIn the scope of the project we have added support for: - multi-class classification, in addition to regression and binary classification; - GGN, in addition to empirical Fisher; - hardware-parallelized batched computation of both the empirical Fisher and the GGN; - subnetwork and last-layer Laplace; - KFAC for multi-class classification with Fisher; and - interfacing with MLJ, a common machine learning framework for Julia.\nWe have also made quality assurance / quality-of-life additions to the repository, adding: - a formatting check in the CI/CD pipeline; - an extensive test suite comparing the results of LaplaceRedux.jl against those of its Python counter-part package laplace-torch; and - a benchmark pipeline tracking possible downturns in performance."
+ "objectID": "blog/posts/a-new-tool-for-explainable-ai/index.html#further-reading",
+ "href": "blog/posts/a-new-tool-for-explainable-ai/index.html#further-reading",
+ "title": "A new tool for explainable AI",
+ "section": "Further reading 📚",
+ "text": "Further reading 📚\nIf you’re interested in learning more about this development, feel free to check out the following resources:\n\nPackage docs: [stable], [dev].\nContributor’s guide.\nGitHub repo."
},
{
- "objectID": "blog/posts/guest-students-laplace/index.html#methodology",
- "href": "blog/posts/guest-students-laplace/index.html#methodology",
- "title": "Paving the Way Towards Low-Overhead Uncertainty Calibration",
- "section": "Methodology",
- "text": "Methodology\nWe adhered to the Agile/Scrum practices, with two-week-long sprints, and weekly meetings with our formal client, Patrick Altmeyer. We have prioritized the expected requirements by the Moscow method into must-, could-, should-, and won’t-haves. This is all fairly standard for BSc software projects at TU Delft. By the end of the project, we have completed all of our self-assigned must-haves and should-haves."
+ "objectID": "blog/posts/a-new-tool-for-explainable-ai/index.html#footnotes",
+ "href": "blog/posts/a-new-tool-for-explainable-ai/index.html#footnotes",
+ "title": "A new tool for explainable AI",
+ "section": "Footnotes",
+ "text": "Footnotes\n\n\nSee: [TDS], [blog]↩︎\nFor more information on Bayesian deep learning see my previous post: [TDS], [blog].↩︎\nThe corresponding example involving PyTorch is analogous and therefore not included here. You may find it here.↩︎"
},
{
- "objectID": "blog/posts/guest-students-laplace/index.html#pain-points",
- "href": "blog/posts/guest-students-laplace/index.html#pain-points",
- "title": "Paving the Way Towards Low-Overhead Uncertainty Calibration",
- "section": "Pain Points",
- "text": "Pain Points\nHere we list some obstacles we have encountered along the way: - Julia is slow to compile and load dependencies on less powerful machines. - Stack traces are sometimes rather obscure, though it seems to be the price to pay for macros. - Zygote.jl, the automatic differentiation library, is not self-autodifferentiable – it cannot differentiate its own functions. We would want this since we apply Zygote.jacobians when making predictions with the LA. - There is no accessible tool reporting branch coverage on tests – only line coverage is available. - Limited LSP and Unicode support for Jupyter Lab. - Conversion between Flux and ONNX is not yet implemented. - There is no extension library for Zygote equivalent to BackPACK or ASDL for second-order information.\n\nZygote.jl, the automatic differentiation library, is not self-autodifferentiable: issue. We would want this since we apply Zygote.jacobians when making predictions with the LA.\nThere is no accessible tool reporting branch coverage on tests – only line coverage is available.\nLimited LSP and Unicode support for Jupyter Lab.\nNo conversion between Flux and ONNX is implemented yet ONNX.jl\nThere is no extension library for Zygote equivalent to BackPACK or ASDL for second-order information."
+ "objectID": "blog/posts/conformal-image-classifier/index.html",
+ "href": "blog/posts/conformal-image-classifier/index.html",
+ "title": "How to Conformalize a Deep Image Classifier",
+ "section": "",
+ "text": "Conformalized prediction sets for asimple Deep Image Classifier.\nDeep Learning is popular and — for some tasks like image classification — remarkably powerful. But it is also well-known that Deep Neural Networks (DNN) can be unstable (Goodfellow, Shlens, and Szegedy 2014) and poorly calibrated. Conformal Prediction can be used to mitigate these pitfalls.\nIn the first part of this series of posts on Conformal Prediction, we looked at the basic underlying methodology and how CP can be implemented in Julia using ConformalPrediction.jl. This second part of the series is a more goal-oriented how-to guide: it demonstrates how you can conformalize a deep learning image classifier built in Flux.jl in just a few lines of code.\nSince this is meant to be more of a hands-on article, we will avoid diving too deeply into methodological concepts. If you need more colour on this, be sure to check out the first article on this topic and also A. N. Angelopoulos and Bates (2022). For a more formal treatment of Conformal Prediction see also A. Angelopoulos et al. (2022)."
},
{
- "objectID": "blog/posts/guest-students-laplace/index.html#highlights",
- "href": "blog/posts/guest-students-laplace/index.html#highlights",
- "title": "Paving the Way Towards Low-Overhead Uncertainty Calibration",
- "section": "Highlights",
- "text": "Highlights\nAnd here is what we found refreshing: - Metaprogramming and first-class support for macros are something completely different for students who are used to Java & Python. - The Julia standard API, and Flux/Zygote, are fairly straightforward to use, and well-thought-out for numerical computing and machine learning."
+ "objectID": "blog/posts/conformal-image-classifier/index.html#the-task-at-hand",
+ "href": "blog/posts/conformal-image-classifier/index.html#the-task-at-hand",
+ "title": "How to Conformalize a Deep Image Classifier",
+ "section": "🎯 The Task at Hand",
+ "text": "🎯 The Task at Hand\nThe task at hand is to predict the labels of handwritten images of digits using the famous MNIST dataset (LeCun 1998). Importing this popular machine learning dataset in Julia is made remarkably easy through MLDatasets.jl:\n\n\nCode\nusing MLDatasets\nN = 1000\nXraw, yraw = MNIST(split=:train)[:]\nXraw = Xraw[:,:,1:N]\nyraw = yraw[1:N]\n\n\nFigure 1 below shows a few random samples from the training data:\n\n\nCode\nusing MLJ\nusing Images\nX = map(x -> convert2image(MNIST, x), eachslice(Xraw, dims=3))\ny = coerce(yraw, Multiclass)\n\nn_samples = 10\nmosaic(rand(X, n_samples)..., ncol=n_samples)\n\n\n\n\n\n\n\nFigure 1: Random samples from the MNIST dataset."
},
{
- "objectID": "blog/posts/guest-students-laplace/index.html#conclusions",
- "href": "blog/posts/guest-students-laplace/index.html#conclusions",
- "title": "Paving the Way Towards Low-Overhead Uncertainty Calibration",
- "section": "Conclusions",
- "text": "Conclusions\nWe have covered some elements of the theory behind Laplace Approximations, laid down our additions to the LaplaceRedux.jl package, and brought out some difficulties we, as complete newcomers to Julia, came across. Hope you have enjoyed the tour, and hopefully it has intrigued you enough to look deeper into Bayesian learning and/or Julia since both are developing at a lively pace. You can check out LaplaceRedux on the JuliaTrustworthyAI GitHub page here. Contributions and comments are welcome!"
+ "objectID": "blog/posts/conformal-image-classifier/index.html#building-the-network",
+ "href": "blog/posts/conformal-image-classifier/index.html#building-the-network",
+ "title": "How to Conformalize a Deep Image Classifier",
+ "section": "🚧 Building the Network",
+ "text": "🚧 Building the Network\nTo model the mapping from image inputs to labels will rely on a simple Multi-Layer Perceptron (MLP). A great Julia library for Deep Learning is Flux.jl. But wait … doesn’t ConformalPrediction.jl work with models trained in MLJ.jl? That’s right, but fortunately there exists a Flux.jl interface to MLJ.jl, namely MLJFlux.jl. The interface is still in its early stages, but already very powerful and easily accessible for anyone (like myself) who is used to building Neural Networks in Flux.jl.\nIn Flux.jl, you could build an MLP for this task as follows,\n\n\nCode\nusing Flux\n\nmlp = Chain(\n Flux.flatten,\n Dense(prod((28,28)), 32, relu),\n Dense(32, 10)\n)\n\n\nwhere (28,28) is just the input dimension (28x28 pixel images). Since we have ten digits, our output dimension is ten.1\nWe can do the exact same thing in MLJFlux.jl as follows,\n\n\nCode\nusing MLJFlux\n\nbuilder = MLJFlux.@builder Chain(\n Flux.flatten,\n Dense(prod(n_in), 32, relu),\n Dense(32, n_out)\n)\n\n\nwhere here we rely on the @builder macro to make the transition from Flux.jl to MLJ.jl as seamless as possible. Finally, MLJFlux.jl already comes with a number of helper functions to define plain-vanilla networks. In this case, we will use the ImageClassifier with our custom builder and cross-entropy loss:\n\n\nCode\nImageClassifier = @load ImageClassifier\nclf = ImageClassifier(\n builder=builder,\n epochs=10,\n loss=Flux.crossentropy\n)\n\n\nThe generated instance clf is a model (in the MLJ.jl sense) so from this point on we can rely on standard MLJ.jl workflows. For example, we can wrap our model in data to create a machine and then evaluate it on a holdout set as follows:\n\n\nCode\nmach = machine(clf, X, y)\n\nevaluate!(\n mach,\n resampling=Holdout(rng=123, fraction_train=0.8),\n operation=predict_mode,\n measure=[accuracy]\n)\n\n\nThe accuracy of our very simple model is not amazing, but good enough for the purpose of this tutorial. For each image, our MLP returns a softmax output for each possible digit: 0,1,2,3,…,9. Since each individual softmax output is valued between zero and one, \\(y_k\\in(0,1)\\), this is commonly interpreted as a probability: \\(y_k \\coloneqq p(y=k|X)\\). Edge cases – that is values close to either zero or one – indicate high predictive certainty. But this is only a heuristic notion of predictive uncertainty (A. N. Angelopoulos and Bates 2022). Next, we will turn this heuristic notion of uncertainty into a rigorous one using Conformal Prediction."
},
{
- "objectID": "blog/posts/guest-students-laplace/index.html#acknowedgements",
- "href": "blog/posts/guest-students-laplace/index.html#acknowedgements",
- "title": "Paving the Way Towards Low-Overhead Uncertainty Calibration",
- "section": "Acknowedgements",
- "text": "Acknowedgements\nOur team members are Mark Ardman, Severin Bratus, Adelina Cazacu, Andrei Ionescu, and Ivan Makarov. We would like to thank Patrick Altmeyer for the opportunity to work on this unique project and for the continuous guidance throughout the development process. We are also grateful to Sebastijan Dumančić, our coach, Sven van der Voort, our TA mentor, and Antony Bartlett, our supporting advisor."
+ "objectID": "blog/posts/conformal-image-classifier/index.html#conformalizing-the-network",
+ "href": "blog/posts/conformal-image-classifier/index.html#conformalizing-the-network",
+ "title": "How to Conformalize a Deep Image Classifier",
+ "section": "🔥 Conformalizing the Network",
+ "text": "🔥 Conformalizing the Network\nSince clf is a model, it is also compatible with our package: ConformalPrediction.jl. To conformalize our MLP, we therefore only need to call conformal_model(clf). Since the generated instance conf_model is also just a model, we can still rely on standard MLJ.jl workflows. Below we first wrap it in data and then fit it. Aaaand … we’re done! Let’s look at the results in the next section.\n\n\nCode\nusing ConformalPrediction\nconf_model = conformal_model(clf; method=:simple_inductive, coverage=.95)\nmach = machine(conf_model, X, y)\nfit!(mach)"
},
{
- "objectID": "index.html#contribute",
- "href": "index.html#contribute",
- "title": "",
- "section": "Contribute",
- "text": "Contribute\nWe welcome contributions of any kind. If you want to get involved or use our software for or project, please feel free to reach out. If you have questions, comments or issues related to specific packages, please feel free to open issues or discussions on the respective repository.\n\nWorking on related projects?\nAre you working on a Julia package that would fit well into this organization? Or do you perhaps have ideas for future projects? We’d love to hear about it, so please do get in touch!"
+ "objectID": "blog/posts/conformal-image-classifier/index.html#results",
+ "href": "blog/posts/conformal-image-classifier/index.html#results",
+ "title": "How to Conformalize a Deep Image Classifier",
+ "section": "📊 Results",
+ "text": "📊 Results\nFigure 2 below presents the results. Figure 2 (a) displays highly certain predictions, now defined in the rigorous sense of Conformal Prediction: in each case, the conformal set (just beneath the image) includes only one label.\nFigure 2 (b) and Figure 2 (c) display increasingly uncertain predictions of set size two and three, respectively. They demonstrate that CP is well equipped to deal with samples characterized by high aleatoric uncertainty: digits four (4), seven (7) and nine (9) share certain similarities. So do digits five (5) and six (6) as well as three (3) and eight (8). These may be hard to distinguish from each other even after seeing many examples (and even for a human). It is therefore unsurprising to see that these digits often end up together in conformal sets.\n\n\n\n\n\n\n\n\n\n\n\n\n \n \n \n\n\n\n \n \n \n\n\n\n \n \n \n\n\n\n\n\n\n \n \n \n\n\n\n\n\n\n \n \n \n\n\n\n\n\n\n\n(a) Randomly selected prediction sets of size \\(|C|=1\\).\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n \n \n\n\n\n \n \n \n\n\n\n \n \n \n\n\n\n\n\n\n \n \n \n\n\n\n\n\n\n \n \n \n\n\n\n\n\n\n\n(b) Randomly selected prediction sets of size \\(|C|=2\\).\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n \n \n\n\n\n \n \n \n\n\n\n \n \n \n\n\n\n\n\n\n \n \n \n\n\n\n\n\n\n \n \n \n\n\n\n\n\n\n\n(c) Randomly selected prediction sets of size \\(|C|=3\\).\n\n\n\n\n\n\n\nFigure 2: Conformalized predictions from an image classifier."
},
{
- "objectID": "index.html#contact",
- "href": "index.html#contact",
+ "objectID": "blog/posts/conformal-image-classifier/index.html#evaluation",
+ "href": "blog/posts/conformal-image-classifier/index.html#evaluation",
+ "title": "How to Conformalize a Deep Image Classifier",
+ "section": "🧐 Evaluation",
+ "text": "🧐 Evaluation\nTo evaluate the performance of conformal models, specific performance measures can be used to assess if the model is correctly specified and well-calibrated (A. N. Angelopoulos and Bates 2022). We will look at this in some more detail in another post in the future. For now, just be aware that these measures are already available in ConformalPrediction.jl and we will briefly showcase them here.\nAs for many other things, ConformalPrediction.jl taps into the existing functionality of MLJ.jl for model evaluation. In particular, we will see below how we can use the generic evaluate! method on our machine. To assess the correctness of our conformal predictor, we can compute the empirical coverage rate using the custom performance measure emp_coverage. With respect to model calibration we will look at the model’s conditional coverage. For adaptive, well-calibrated conformal models, conditional coverage is high. One general go-to measure for assessing conditional coverage is size-stratified coverage. The custom measure for this purpose is just called size_stratified_coverage, aliased by ssc.\nThe code below implements the model evaluation using cross-validation. The Simple Inductive Classifier that we used above is not adaptive and hence the attained conditional coverage is low compared to the overall empirical coverage, which is close to \\(0.95\\), so in line with the desired coverage rate specified above.\n\n\nCode\n_eval = evaluate!(\n mach,\n resampling=CV(),\n operation=predict,\n measure=[emp_coverage, ssc]\n)\ndisplay(_eval)\nprintln(\"Empirical coverage: $(round(_eval.measurement[1], digits=3))\")\nprintln(\"SSC: $(round(_eval.measurement[2], digits=3))\")\n\n\n\nPerformanceEvaluation object with these fields:\n measure, operation, measurement, per_fold,\n per_observation, fitted_params_per_fold,\n report_per_fold, train_test_rows\nExtract:\n┌──────────────────────────────────────────────┬───────────┬─────────────┬──────\n│ measure │ operation │ measurement │ 1.9 ⋯\n├──────────────────────────────────────────────┼───────────┼─────────────┼──────\n│ ConformalPrediction.emp_coverage │ predict │ 0.954 │ 0.0 ⋯\n│ ConformalPrediction.size_stratified_coverage │ predict │ 0.661 │ 0.3 ⋯\n└──────────────────────────────────────────────┴───────────┴─────────────┴──────\n 2 columns omitted\n\n\n\n\nEmpirical coverage: 0.954\nSSC: 0.661\n\n\nWe can attain higher adaptivity (SSC) when using adaptive prediction sets:\n\n\nCode\nconf_model = conformal_model(clf; method=:adaptive_inductive, coverage=.95)\nmach = machine(conf_model, X, y)\nfit!(mach)\n_eval = evaluate!(\n mach,\n resampling=CV(),\n operation=predict,\n measure=[emp_coverage, ssc]\n)\nresults[:adaptive_inductive] = mach\ndisplay(_eval)\nprintln(\"Empirical coverage: $(round(_eval.measurement[1], digits=3))\")\nprintln(\"SSC: $(round(_eval.measurement[2], digits=3))\")\n\n\n\nPerformanceEvaluation object with these fields:\n measure, operation, measurement, per_fold,\n per_observation, fitted_params_per_fold,\n report_per_fold, train_test_rows\nExtract:\n┌──────────────────────────────────────────────┬───────────┬─────────────┬──────\n│ measure │ operation │ measurement │ 1.9 ⋯\n├──────────────────────────────────────────────┼───────────┼─────────────┼──────\n│ ConformalPrediction.emp_coverage │ predict │ 0.995 │ 0.0 ⋯\n│ ConformalPrediction.size_stratified_coverage │ predict │ 0.981 │ 0.0 ⋯\n└──────────────────────────────────────────────┴───────────┴─────────────┴──────\n 2 columns omitted\n\n\n\n\nEmpirical coverage: 0.995\nSSC: 0.981\n\n\nWe can also have a look at the resulting set size for both approaches using a custom Plots.jl recipe (fig-setsize). In line with the above, the spread is wider for the adaptive approach, which reflects that “the procedure is effectively distinguishing between easy and hard inputs” (A. N. Angelopoulos and Bates 2022).\n\n\nCode\nplt_list = []\nfor (_mod, mach) in results\n push!(plt_list, bar(mach.model, mach.fitresult, X; title=String(_mod)))\nend\nplot(plt_list..., size=(800,300))\nplot(plt_list..., size=(800,300),bg_colour=:transparent)\n\n\n\n\n\n\n\n \n \n \n\n\n \n \n \n\n\n \n \n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n \n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nFigure 3: Distribution of set sizes for both approaches."
+ },
+ {
+ "objectID": "blog/posts/conformal-image-classifier/index.html#recap",
+ "href": "blog/posts/conformal-image-classifier/index.html#recap",
+ "title": "How to Conformalize a Deep Image Classifier",
+ "section": "🔁 Recap",
+ "text": "🔁 Recap\nIn this short guide, we have seen how easy it is to conformalize a deep learning image classifier in Julia using ConformalPrediction.jl. Almost any deep neural network trained in Flux.jl is compatible with MLJ.jl and can therefore be conformalized in just a few lines of code. This makes it remarkably easy to move uncertainty heuristics to rigorous predictive uncertainty estimates. We have also seen a sneak peek at the performance evaluation of conformal predictors. Stay tuned for more!"
+ },
+ {
+ "objectID": "blog/posts/conformal-image-classifier/index.html#footnotes",
+ "href": "blog/posts/conformal-image-classifier/index.html#footnotes",
+ "title": "How to Conformalize a Deep Image Classifier",
+ "section": "Footnotes",
+ "text": "Footnotes\n\n\nFor a full tutorial on how to build an MNIST image classifier relying solely on Flux.jl, check out this tutorial.↩︎"
+ },
+ {
+ "objectID": "hero.html",
+ "href": "hero.html",
+ "title": "Make sense of your AI models",
+ "section": "",
+ "text": "Taija takes part in Julia Season of Contributions\n\n\n\n\n\n\n\nCounterfactual Explanations\n\n\n\n\n\n\n\nConformal Prediction\n\n\n\n\n\n\n\nBayesian Deep Learning\n\n\n\n\n\nPrevoius\n\n\n\nNext\n\n\n\n\n\n\nMake sense of your AI models\nArtificial Intelligence (AI) has been advancing rapidly in recent years. Consequently, Julia’s AI ecosystem has also been growing fast. Taija is an effort to provide users with tools to make sense of the AI models that they train and deploy. Some highlights include:\n\nModel Explainability (CounterfactualExplanations.jl)\nAlgorithmic Recourse (CounterfactualExplanations.jl, AlgorithmicRecourseDynamics.jl)\nPredictive Uncertainty Quantification (ConformalPrediction.jl, LaplaceRedux.jl)\nEffortless Bayesian Deep Learning (LaplaceRedux.jl)\nHybrid Learning (JointEnergyModels.jl)\n\nTaija is a community effort largely maintained by academics and students at TU Delft. We welcome contributions of any kind."
+ },
+ {
+ "objectID": "content/related.html",
+ "href": "content/related.html",
"title": "",
- "section": "Contact",
- "text": "Contact\nProbably the easiest way is to join the JuliaLang Slack and join our #taija channel. You can also post a GitHub Issue on our organization repo. You can find @pat-alt’s socials and contact details on his website: www.patalt.org."
+ "section": "",
+ "text": "Our packages are currently tailored towards the following larger package ecosystems for AI and machine learning in Julia:\n\nFluxML\nMLJ\n\nOther external packages and ecosystems related to Trustworthy AI in Julia include:\n\nJulia-XAI\nShapML.jl"
},
{
- "objectID": "content/news/news.html",
- "href": "content/news/news.html",
+ "objectID": "content/related.html#related-software",
+ "href": "content/related.html#related-software",
"title": "",
"section": "",
- "text": "Taija takes part in Julia Season of Contributions\n\n\n\n\n\n\n\nCounterfactual Explanations\n\n\n\n\n\n\n\nConformal Prediction\n\n\n\n\n\n\n\nBayesian Deep Learning\n\n\n\n\n\nPrevoius\n\n\n\nNext"
+ "text": "Our packages are currently tailored towards the following larger package ecosystems for AI and machine learning in Julia:\n\nFluxML\nMLJ\n\nOther external packages and ecosystems related to Trustworthy AI in Julia include:\n\nJulia-XAI\nShapML.jl"
},
{
- "objectID": "content/contribute.html",
- "href": "content/contribute.html",
+ "objectID": "content/about.html",
+ "href": "content/about.html",
"title": "",
"section": "",
- "text": "We welcome contributions of any kind. If you want to get involved or use our software for or project, please feel free to reach out. If you have questions, comments or issues related to specific packages, please feel free to open issues or discussions on the respective repository.\n\n\nAre you working on a Julia package that would fit well into this organization? Or do you perhaps have ideas for future projects? We’d love to hear about it, so please do get in touch!"
+ "text": "Taija currently covers a range of approaches towards making AI systems more trustworthy:\n\nModel Explainability (CounterfactualExplanations.jl)\nAlgorithmic Recourse (CounterfactualExplanations.jl, AlgorithmicRecourseDynamics.jl)\nPredictive Uncertainty Quantification (ConformalPrediction.jl, LaplaceRedux.jl)\nEffortless Bayesian Deep Learning (LaplaceRedux.jl)\nHybrid Learning (JointEnergyModels.jl)\n\nVarious meta packages can be used to extend the core functionality:\n\nPlotting (TaijaPlotting.jl)\nDatasets for testing and benchmarking (TaijaData.jl)\nParallelization (TaijaParallel.jl)\nInteroperability with other programming languages (TaijaInteroperability.jl)\n\nThe TaijaBase.jl package provides common symbols, types and functions that are used across all or multiple Taija packages.\n\n\n\n\n\n\n%%{\n init: {\n 'theme': 'base',\n 'themeVariables': {\n 'primaryColor': '#BB2528',\n 'primaryTextColor': '#fff',\n 'primaryBorderColor': '#7C0000',\n 'lineColor': '#F8B229',\n 'secondaryColor': '#006100',\n 'tertiaryColor': '#e9edfb',\n 'fontFamily': \"avenir\"\n }\n }\n}%%\n\nflowchart TB\n\n classDef taija fill:#389836,stroke:#333,color:#fff;\n classDef core fill:#CB3C33,stroke:#333,color:#fff;\n classDef base fill:#9558B2,stroke:#333,color:#fff;\n\n %% Base\n base[\"TaijaBase.jl\"]\n\n %% Meta\n interop[\"TaijaInteroperability.jl\"]\n data[\"TaijaData.jl\"]\n parallel[\"TaijaParallel.jl\"]\n plotting[\"TaijaPlotting.jl\"]\n\n %% Core\n ce[\"CounterfactualExplanations.jl\"]\n ar[\"AlgorithmiRecourseDynamics.jl\"]\n cp[\"ConformalPrediction.jl\"]\n lr[\"LaplaceRedux.jl\"]\n jem[\"JointEnergyModels.jl\"]\n\n class base base;\n class interop,data,parallel,plotting taija;\n class ce,cp,lr,jem,ar core;\n\n %% Graph\n subgraph \"Meta Packages\"\n data & plotting & parallel & interop\n end\n\n subgraph \"Core Packages\"\n ce & cp & lr & jem & ar\n end\n\n\n\nFigure 1: An overview of the Taija ecosystem.\n\n\n\n\n\n\nWhy Taija?\n\nTaija stands for Trustworthy Artificial Intelligence in Julia. When thinking about a logo that embodies trustworthiness, we quickly landed on 🐶."
},
{
- "objectID": "content/contribute.html#contribute",
- "href": "content/contribute.html#contribute",
+ "objectID": "content/about.html#about",
+ "href": "content/about.html#about",
"title": "",
"section": "",
- "text": "We welcome contributions of any kind. If you want to get involved or use our software for or project, please feel free to reach out. If you have questions, comments or issues related to specific packages, please feel free to open issues or discussions on the respective repository.\n\n\nAre you working on a Julia package that would fit well into this organization? Or do you perhaps have ideas for future projects? We’d love to hear about it, so please do get in touch!"
+ "text": "Taija currently covers a range of approaches towards making AI systems more trustworthy:\n\nModel Explainability (CounterfactualExplanations.jl)\nAlgorithmic Recourse (CounterfactualExplanations.jl, AlgorithmicRecourseDynamics.jl)\nPredictive Uncertainty Quantification (ConformalPrediction.jl, LaplaceRedux.jl)\nEffortless Bayesian Deep Learning (LaplaceRedux.jl)\nHybrid Learning (JointEnergyModels.jl)\n\nVarious meta packages can be used to extend the core functionality:\n\nPlotting (TaijaPlotting.jl)\nDatasets for testing and benchmarking (TaijaData.jl)\nParallelization (TaijaParallel.jl)\nInteroperability with other programming languages (TaijaInteroperability.jl)\n\nThe TaijaBase.jl package provides common symbols, types and functions that are used across all or multiple Taija packages.\n\n\n\n\n\n\n%%{\n init: {\n 'theme': 'base',\n 'themeVariables': {\n 'primaryColor': '#BB2528',\n 'primaryTextColor': '#fff',\n 'primaryBorderColor': '#7C0000',\n 'lineColor': '#F8B229',\n 'secondaryColor': '#006100',\n 'tertiaryColor': '#e9edfb',\n 'fontFamily': \"avenir\"\n }\n }\n}%%\n\nflowchart TB\n\n classDef taija fill:#389836,stroke:#333,color:#fff;\n classDef core fill:#CB3C33,stroke:#333,color:#fff;\n classDef base fill:#9558B2,stroke:#333,color:#fff;\n\n %% Base\n base[\"TaijaBase.jl\"]\n\n %% Meta\n interop[\"TaijaInteroperability.jl\"]\n data[\"TaijaData.jl\"]\n parallel[\"TaijaParallel.jl\"]\n plotting[\"TaijaPlotting.jl\"]\n\n %% Core\n ce[\"CounterfactualExplanations.jl\"]\n ar[\"AlgorithmiRecourseDynamics.jl\"]\n cp[\"ConformalPrediction.jl\"]\n lr[\"LaplaceRedux.jl\"]\n jem[\"JointEnergyModels.jl\"]\n\n class base base;\n class interop,data,parallel,plotting taija;\n class ce,cp,lr,jem,ar core;\n\n %% Graph\n subgraph \"Meta Packages\"\n data & plotting & parallel & interop\n end\n\n subgraph \"Core Packages\"\n ce & cp & lr & jem & ar\n end\n\n\n\nFigure 1: An overview of the Taija ecosystem.\n\n\n\n\n\n\nWhy Taija?\n\nTaija stands for Trustworthy Artificial Intelligence in Julia. When thinking about a logo that embodies trustworthiness, we quickly landed on 🐶."
},
{
- "objectID": "content/contact.html",
- "href": "content/contact.html",
+ "objectID": "content/sponsors.html",
+ "href": "content/sponsors.html",
"title": "",
"section": "",
- "text": "Probably the easiest way is to join the JuliaLang Slack and join our #taija channel. You can also post a GitHub Issue on our organization repo. You can find @pat-alt’s socials and contact details on his website: www.patalt.org."
+ "text": "Some of Taija’s contributors have been partially or fully funded by one or more of the following entities:"
},
{
- "objectID": "content/contact.html#contact",
- "href": "content/contact.html#contact",
+ "objectID": "content/sponsors.html#sponsors",
+ "href": "content/sponsors.html#sponsors",
"title": "",
"section": "",
- "text": "Probably the easiest way is to join the JuliaLang Slack and join our #taija channel. You can also post a GitHub Issue on our organization repo. You can find @pat-alt’s socials and contact details on his website: www.patalt.org."
+ "text": "Some of Taija’s contributors have been partially or fully funded by one or more of the following entities:"
},
{
"objectID": "content/research.html",
@@ -224,248 +280,213 @@
"text": "Footnotes\n\n\nExperiments were run in parallel using Python’s MAPIE and ConformalPrediction.jl, in order to cross-check results. Reported results were produced using MAPIE.↩︎"
},
{
- "objectID": "content/sponsors.html",
- "href": "content/sponsors.html",
- "title": "",
- "section": "",
- "text": "Some of Taija’s contributors have been partially or fully funded by one or more of the following entities:"
- },
- {
- "objectID": "content/sponsors.html#sponsors",
- "href": "content/sponsors.html#sponsors",
+ "objectID": "content/contact.html",
+ "href": "content/contact.html",
"title": "",
"section": "",
- "text": "Some of Taija’s contributors have been partially or fully funded by one or more of the following entities:"
+ "text": "Probably the easiest way is to join the JuliaLang Slack and join our #taija channel. You can also post a GitHub Issue on our organization repo. You can find @pat-alt’s socials and contact details on his website: www.patalt.org."
},
{
- "objectID": "content/about.html",
- "href": "content/about.html",
+ "objectID": "content/contact.html#contact",
+ "href": "content/contact.html#contact",
"title": "",
"section": "",
- "text": "Taija currently covers a range of approaches towards making AI systems more trustworthy:\n\nModel Explainability (CounterfactualExplanations.jl)\nAlgorithmic Recourse (CounterfactualExplanations.jl, AlgorithmicRecourseDynamics.jl)\nPredictive Uncertainty Quantification (ConformalPrediction.jl, LaplaceRedux.jl)\nEffortless Bayesian Deep Learning (LaplaceRedux.jl)\nHybrid Learning (JointEnergyModels.jl)\n\nVarious meta packages can be used to extend the core functionality:\n\nPlotting (TaijaPlotting.jl)\nDatasets for testing and benchmarking (TaijaData.jl)\nParallelization (TaijaParallel.jl)\nInteroperability with other programming languages (TaijaInteroperability.jl)\n\nThe TaijaBase.jl package provides common symbols, types and functions that are used across all or multiple Taija packages.\n\n\n\n\n\n\n%%{\n init: {\n 'theme': 'base',\n 'themeVariables': {\n 'primaryColor': '#BB2528',\n 'primaryTextColor': '#fff',\n 'primaryBorderColor': '#7C0000',\n 'lineColor': '#F8B229',\n 'secondaryColor': '#006100',\n 'tertiaryColor': '#e9edfb',\n 'fontFamily': \"avenir\"\n }\n }\n}%%\n\nflowchart TB\n\n classDef taija fill:#389836,stroke:#333,color:#fff;\n classDef core fill:#CB3C33,stroke:#333,color:#fff;\n classDef base fill:#9558B2,stroke:#333,color:#fff;\n\n %% Base\n base[\"TaijaBase.jl\"]\n\n %% Meta\n interop[\"TaijaInteroperability.jl\"]\n data[\"TaijaData.jl\"]\n parallel[\"TaijaParallel.jl\"]\n plotting[\"TaijaPlotting.jl\"]\n\n %% Core\n ce[\"CounterfactualExplanations.jl\"]\n ar[\"AlgorithmiRecourseDynamics.jl\"]\n cp[\"ConformalPrediction.jl\"]\n lr[\"LaplaceRedux.jl\"]\n jem[\"JointEnergyModels.jl\"]\n\n class base base;\n class interop,data,parallel,plotting taija;\n class ce,cp,lr,jem,ar core;\n\n %% Graph\n subgraph \"Meta Packages\"\n data & plotting & parallel & interop\n end\n\n subgraph \"Core Packages\"\n ce & cp & lr & jem & ar\n end\n\n\n\nFigure 1: An overview of the Taija ecosystem.\n\n\n\n\n\n\nWhy Taija?\n\nTaija stands for Trustworthy Artificial Intelligence in Julia. When thinking about a logo that embodies trustworthiness, we quickly landed on 🐶."
+ "text": "Probably the easiest way is to join the JuliaLang Slack and join our #taija channel. You can also post a GitHub Issue on our organization repo. You can find @pat-alt’s socials and contact details on his website: www.patalt.org."
},
{
- "objectID": "content/about.html#about",
- "href": "content/about.html#about",
+ "objectID": "content/contribute.html",
+ "href": "content/contribute.html",
"title": "",
"section": "",
- "text": "Taija currently covers a range of approaches towards making AI systems more trustworthy:\n\nModel Explainability (CounterfactualExplanations.jl)\nAlgorithmic Recourse (CounterfactualExplanations.jl, AlgorithmicRecourseDynamics.jl)\nPredictive Uncertainty Quantification (ConformalPrediction.jl, LaplaceRedux.jl)\nEffortless Bayesian Deep Learning (LaplaceRedux.jl)\nHybrid Learning (JointEnergyModels.jl)\n\nVarious meta packages can be used to extend the core functionality:\n\nPlotting (TaijaPlotting.jl)\nDatasets for testing and benchmarking (TaijaData.jl)\nParallelization (TaijaParallel.jl)\nInteroperability with other programming languages (TaijaInteroperability.jl)\n\nThe TaijaBase.jl package provides common symbols, types and functions that are used across all or multiple Taija packages.\n\n\n\n\n\n\n%%{\n init: {\n 'theme': 'base',\n 'themeVariables': {\n 'primaryColor': '#BB2528',\n 'primaryTextColor': '#fff',\n 'primaryBorderColor': '#7C0000',\n 'lineColor': '#F8B229',\n 'secondaryColor': '#006100',\n 'tertiaryColor': '#e9edfb',\n 'fontFamily': \"avenir\"\n }\n }\n}%%\n\nflowchart TB\n\n classDef taija fill:#389836,stroke:#333,color:#fff;\n classDef core fill:#CB3C33,stroke:#333,color:#fff;\n classDef base fill:#9558B2,stroke:#333,color:#fff;\n\n %% Base\n base[\"TaijaBase.jl\"]\n\n %% Meta\n interop[\"TaijaInteroperability.jl\"]\n data[\"TaijaData.jl\"]\n parallel[\"TaijaParallel.jl\"]\n plotting[\"TaijaPlotting.jl\"]\n\n %% Core\n ce[\"CounterfactualExplanations.jl\"]\n ar[\"AlgorithmiRecourseDynamics.jl\"]\n cp[\"ConformalPrediction.jl\"]\n lr[\"LaplaceRedux.jl\"]\n jem[\"JointEnergyModels.jl\"]\n\n class base base;\n class interop,data,parallel,plotting taija;\n class ce,cp,lr,jem,ar core;\n\n %% Graph\n subgraph \"Meta Packages\"\n data & plotting & parallel & interop\n end\n\n subgraph \"Core Packages\"\n ce & cp & lr & jem & ar\n end\n\n\n\nFigure 1: An overview of the Taija ecosystem.\n\n\n\n\n\n\nWhy Taija?\n\nTaija stands for Trustworthy Artificial Intelligence in Julia. When thinking about a logo that embodies trustworthiness, we quickly landed on 🐶."
+ "text": "We welcome contributions of any kind. If you want to get involved or use our software for or project, please feel free to reach out. If you have questions, comments or issues related to specific packages, please feel free to open issues or discussions on the respective repository.\n\n\nAre you working on a Julia package that would fit well into this organization? Or do you perhaps have ideas for future projects? We’d love to hear about it, so please do get in touch!"
},
{
- "objectID": "content/related.html",
- "href": "content/related.html",
+ "objectID": "content/contribute.html#contribute",
+ "href": "content/contribute.html#contribute",
"title": "",
"section": "",
- "text": "Our packages are currently tailored towards the following larger package ecosystems for AI and machine learning in Julia:\n\nFluxML\nMLJ\n\nOther external packages and ecosystems related to Trustworthy AI in Julia include:\n\nJulia-XAI\nShapML.jl"
+ "text": "We welcome contributions of any kind. If you want to get involved or use our software for or project, please feel free to reach out. If you have questions, comments or issues related to specific packages, please feel free to open issues or discussions on the respective repository.\n\n\nAre you working on a Julia package that would fit well into this organization? Or do you perhaps have ideas for future projects? We’d love to hear about it, so please do get in touch!"
},
{
- "objectID": "content/related.html#related-software",
- "href": "content/related.html#related-software",
+ "objectID": "content/news/news.html",
+ "href": "content/news/news.html",
"title": "",
"section": "",
- "text": "Our packages are currently tailored towards the following larger package ecosystems for AI and machine learning in Julia:\n\nFluxML\nMLJ\n\nOther external packages and ecosystems related to Trustworthy AI in Julia include:\n\nJulia-XAI\nShapML.jl"
- },
- {
- "objectID": "hero.html",
- "href": "hero.html",
- "title": "Taija",
- "section": "",
- "text": "Taija\n\nTrustworthy Artificial Intelligence in Julia\nTaija is the organization that hosts software geared towards Trustworthy Artificial Intelligence in Julia.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nTaija takes part in Julia Season of Contributions\n\n\n\n\n\n\n\nCounterfactual Explanations\n\n\n\n\n\n\n\nConformal Prediction\n\n\n\n\n\n\n\nBayesian Deep Learning\n\n\n\n\n\nPrevoius\n\n\n\nNext"
- },
- {
- "objectID": "blog/posts/conformal-image-classifier/index.html",
- "href": "blog/posts/conformal-image-classifier/index.html",
- "title": "How to Conformalize a Deep Image Classifier",
- "section": "",
- "text": "Conformalized prediction sets for asimple Deep Image Classifier.\nDeep Learning is popular and — for some tasks like image classification — remarkably powerful. But it is also well-known that Deep Neural Networks (DNN) can be unstable (Goodfellow, Shlens, and Szegedy 2014) and poorly calibrated. Conformal Prediction can be used to mitigate these pitfalls.\nIn the first part of this series of posts on Conformal Prediction, we looked at the basic underlying methodology and how CP can be implemented in Julia using ConformalPrediction.jl. This second part of the series is a more goal-oriented how-to guide: it demonstrates how you can conformalize a deep learning image classifier built in Flux.jl in just a few lines of code.\nSince this is meant to be more of a hands-on article, we will avoid diving too deeply into methodological concepts. If you need more colour on this, be sure to check out the first article on this topic and also A. N. Angelopoulos and Bates (2022). For a more formal treatment of Conformal Prediction see also A. Angelopoulos et al. (2022)."
+ "text": "Taija takes part in Julia Season of Contributions\n\n\n\n\n\n\n\nCounterfactual Explanations\n\n\n\n\n\n\n\nConformal Prediction\n\n\n\n\n\n\n\nBayesian Deep Learning\n\n\n\n\n\nPrevoius\n\n\n\nNext"
},
{
- "objectID": "blog/posts/conformal-image-classifier/index.html#the-task-at-hand",
- "href": "blog/posts/conformal-image-classifier/index.html#the-task-at-hand",
- "title": "How to Conformalize a Deep Image Classifier",
- "section": "🎯 The Task at Hand",
- "text": "🎯 The Task at Hand\nThe task at hand is to predict the labels of handwritten images of digits using the famous MNIST dataset (LeCun 1998). Importing this popular machine learning dataset in Julia is made remarkably easy through MLDatasets.jl:\n\n\nCode\nusing MLDatasets\nN = 1000\nXraw, yraw = MNIST(split=:train)[:]\nXraw = Xraw[:,:,1:N]\nyraw = yraw[1:N]\n\n\nFigure 1 below shows a few random samples from the training data:\n\n\nCode\nusing MLJ\nusing Images\nX = map(x -> convert2image(MNIST, x), eachslice(Xraw, dims=3))\ny = coerce(yraw, Multiclass)\n\nn_samples = 10\nmosaic(rand(X, n_samples)..., ncol=n_samples)\n\n\n\n\n\n\n\nFigure 1: Random samples from the MNIST dataset."
+ "objectID": "index.html#trustworthy-artificial-intelligence-in-julia",
+ "href": "index.html#trustworthy-artificial-intelligence-in-julia",
+ "title": "",
+ "section": "Trustworthy Artificial Intelligence in Julia",
+ "text": "Trustworthy Artificial Intelligence in Julia\nTaija is the organization that hosts software geared towards Trustworthy Artificial Intelligence in Julia."
},
{
- "objectID": "blog/posts/conformal-image-classifier/index.html#building-the-network",
- "href": "blog/posts/conformal-image-classifier/index.html#building-the-network",
- "title": "How to Conformalize a Deep Image Classifier",
- "section": "🚧 Building the Network",
- "text": "🚧 Building the Network\nTo model the mapping from image inputs to labels will rely on a simple Multi-Layer Perceptron (MLP). A great Julia library for Deep Learning is Flux.jl. But wait … doesn’t ConformalPrediction.jl work with models trained in MLJ.jl? That’s right, but fortunately there exists a Flux.jl interface to MLJ.jl, namely MLJFlux.jl. The interface is still in its early stages, but already very powerful and easily accessible for anyone (like myself) who is used to building Neural Networks in Flux.jl.\nIn Flux.jl, you could build an MLP for this task as follows,\n\n\nCode\nusing Flux\n\nmlp = Chain(\n Flux.flatten,\n Dense(prod((28,28)), 32, relu),\n Dense(32, 10)\n)\n\n\nwhere (28,28) is just the input dimension (28x28 pixel images). Since we have ten digits, our output dimension is ten.1\nWe can do the exact same thing in MLJFlux.jl as follows,\n\n\nCode\nusing MLJFlux\n\nbuilder = MLJFlux.@builder Chain(\n Flux.flatten,\n Dense(prod(n_in), 32, relu),\n Dense(32, n_out)\n)\n\n\nwhere here we rely on the @builder macro to make the transition from Flux.jl to MLJ.jl as seamless as possible. Finally, MLJFlux.jl already comes with a number of helper functions to define plain-vanilla networks. In this case, we will use the ImageClassifier with our custom builder and cross-entropy loss:\n\n\nCode\nImageClassifier = @load ImageClassifier\nclf = ImageClassifier(\n builder=builder,\n epochs=10,\n loss=Flux.crossentropy\n)\n\n\nThe generated instance clf is a model (in the MLJ.jl sense) so from this point on we can rely on standard MLJ.jl workflows. For example, we can wrap our model in data to create a machine and then evaluate it on a holdout set as follows:\n\n\nCode\nmach = machine(clf, X, y)\n\nevaluate!(\n mach,\n resampling=Holdout(rng=123, fraction_train=0.8),\n operation=predict_mode,\n measure=[accuracy]\n)\n\n\nThe accuracy of our very simple model is not amazing, but good enough for the purpose of this tutorial. For each image, our MLP returns a softmax output for each possible digit: 0,1,2,3,…,9. Since each individual softmax output is valued between zero and one, \\(y_k\\in(0,1)\\), this is commonly interpreted as a probability: \\(y_k \\coloneqq p(y=k|X)\\). Edge cases – that is values close to either zero or one – indicate high predictive certainty. But this is only a heuristic notion of predictive uncertainty (A. N. Angelopoulos and Bates 2022). Next, we will turn this heuristic notion of uncertainty into a rigorous one using Conformal Prediction."
+ "objectID": "index.html#contribute",
+ "href": "index.html#contribute",
+ "title": "",
+ "section": "Contribute",
+ "text": "Contribute\nWe welcome contributions of any kind. If you want to get involved or use our software for or project, please feel free to reach out. If you have questions, comments or issues related to specific packages, please feel free to open issues or discussions on the respective repository.\n\nWorking on related projects?\nAre you working on a Julia package that would fit well into this organization? Or do you perhaps have ideas for future projects? We’d love to hear about it, so please do get in touch!"
},
{
- "objectID": "blog/posts/conformal-image-classifier/index.html#conformalizing-the-network",
- "href": "blog/posts/conformal-image-classifier/index.html#conformalizing-the-network",
- "title": "How to Conformalize a Deep Image Classifier",
- "section": "🔥 Conformalizing the Network",
- "text": "🔥 Conformalizing the Network\nSince clf is a model, it is also compatible with our package: ConformalPrediction.jl. To conformalize our MLP, we therefore only need to call conformal_model(clf). Since the generated instance conf_model is also just a model, we can still rely on standard MLJ.jl workflows. Below we first wrap it in data and then fit it. Aaaand … we’re done! Let’s look at the results in the next section.\n\n\nCode\nusing ConformalPrediction\nconf_model = conformal_model(clf; method=:simple_inductive, coverage=.95)\nmach = machine(conf_model, X, y)\nfit!(mach)"
+ "objectID": "index.html#contact",
+ "href": "index.html#contact",
+ "title": "",
+ "section": "Contact",
+ "text": "Contact\nProbably the easiest way is to join the JuliaLang Slack and join our #taija channel. You can also post a GitHub Issue on our organization repo. You can find @pat-alt’s socials and contact details on his website: www.patalt.org."
},
{
- "objectID": "blog/posts/conformal-image-classifier/index.html#results",
- "href": "blog/posts/conformal-image-classifier/index.html#results",
- "title": "How to Conformalize a Deep Image Classifier",
- "section": "📊 Results",
- "text": "📊 Results\nFigure 2 below presents the results. Figure 2 (a) displays highly certain predictions, now defined in the rigorous sense of Conformal Prediction: in each case, the conformal set (just beneath the image) includes only one label.\nFigure 2 (b) and Figure 2 (c) display increasingly uncertain predictions of set size two and three, respectively. They demonstrate that CP is well equipped to deal with samples characterized by high aleatoric uncertainty: digits four (4), seven (7) and nine (9) share certain similarities. So do digits five (5) and six (6) as well as three (3) and eight (8). These may be hard to distinguish from each other even after seeing many examples (and even for a human). It is therefore unsurprising to see that these digits often end up together in conformal sets.\n\n\n\n\n\n\n\n\n\n\n\n\n \n \n \n\n\n\n \n \n \n\n\n\n \n \n \n\n\n\n\n\n\n \n \n \n\n\n\n\n\n\n \n \n \n\n\n\n\n\n\n\n(a) Randomly selected prediction sets of size \\(|C|=1\\).\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n \n \n\n\n\n \n \n \n\n\n\n \n \n \n\n\n\n\n\n\n \n \n \n\n\n\n\n\n\n \n \n \n\n\n\n\n\n\n\n(b) Randomly selected prediction sets of size \\(|C|=2\\).\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n \n \n\n\n\n \n \n \n\n\n\n \n \n \n\n\n\n\n\n\n \n \n \n\n\n\n\n\n\n \n \n \n\n\n\n\n\n\n\n(c) Randomly selected prediction sets of size \\(|C|=3\\).\n\n\n\n\n\n\n\nFigure 2: Conformalized predictions from an image classifier."
+ "objectID": "blog/posts/guest-students-laplace/index.html",
+ "href": "blog/posts/guest-students-laplace/index.html",
+ "title": "Paving the Way Towards Low-Overhead Uncertainty Calibration",
+ "section": "",
+ "text": "Guest Blog Post\n\n\n\nThis blog post was originally written by Severin Bratus and colleagues from TU Delft and published on Medium. This version of the post includes only minor edits. If you would like to contribute a guest blog post, please get in touch.\nThis post summarizes a quarter-long second-year BSc coursework project at TU Delft. Our team of five students has made multiple improvements to LaplaceRedux.jl, due to Patrick Altmeyer. Inspired by its Pythonic counterpart, laplacet-torch, this Julia library aims to provide low-overhead Bayesian uncertainty calibration to deep neural networks via Laplace Approximations (Daxberger et al. 2021).\nWe will begin by demystifying the technical terms in the last sentence, in order to explain our contributions to the library and highlight some impressions from the experience. Note that our team has begun working on this PhD-tier subject only having had some introductory courses on probability and statistics, machine learning, and computational intelligence, without any prior exposure to Julia."
},
{
- "objectID": "blog/posts/conformal-image-classifier/index.html#evaluation",
- "href": "blog/posts/conformal-image-classifier/index.html#evaluation",
- "title": "How to Conformalize a Deep Image Classifier",
- "section": "🧐 Evaluation",
- "text": "🧐 Evaluation\nTo evaluate the performance of conformal models, specific performance measures can be used to assess if the model is correctly specified and well-calibrated (A. N. Angelopoulos and Bates 2022). We will look at this in some more detail in another post in the future. For now, just be aware that these measures are already available in ConformalPrediction.jl and we will briefly showcase them here.\nAs for many other things, ConformalPrediction.jl taps into the existing functionality of MLJ.jl for model evaluation. In particular, we will see below how we can use the generic evaluate! method on our machine. To assess the correctness of our conformal predictor, we can compute the empirical coverage rate using the custom performance measure emp_coverage. With respect to model calibration we will look at the model’s conditional coverage. For adaptive, well-calibrated conformal models, conditional coverage is high. One general go-to measure for assessing conditional coverage is size-stratified coverage. The custom measure for this purpose is just called size_stratified_coverage, aliased by ssc.\nThe code below implements the model evaluation using cross-validation. The Simple Inductive Classifier that we used above is not adaptive and hence the attained conditional coverage is low compared to the overall empirical coverage, which is close to \\(0.95\\), so in line with the desired coverage rate specified above.\n\n\nCode\n_eval = evaluate!(\n mach,\n resampling=CV(),\n operation=predict,\n measure=[emp_coverage, ssc]\n)\ndisplay(_eval)\nprintln(\"Empirical coverage: $(round(_eval.measurement[1], digits=3))\")\nprintln(\"SSC: $(round(_eval.measurement[2], digits=3))\")\n\n\n\nPerformanceEvaluation object with these fields:\n measure, operation, measurement, per_fold,\n per_observation, fitted_params_per_fold,\n report_per_fold, train_test_rows\nExtract:\n┌──────────────────────────────────────────────┬───────────┬─────────────┬──────\n│ measure │ operation │ measurement │ 1.9 ⋯\n├──────────────────────────────────────────────┼───────────┼─────────────┼──────\n│ ConformalPrediction.emp_coverage │ predict │ 0.954 │ 0.0 ⋯\n│ ConformalPrediction.size_stratified_coverage │ predict │ 0.661 │ 0.3 ⋯\n└──────────────────────────────────────────────┴───────────┴─────────────┴──────\n 2 columns omitted\n\n\n\n\nEmpirical coverage: 0.954\nSSC: 0.661\n\n\nWe can attain higher adaptivity (SSC) when using adaptive prediction sets:\n\n\nCode\nconf_model = conformal_model(clf; method=:adaptive_inductive, coverage=.95)\nmach = machine(conf_model, X, y)\nfit!(mach)\n_eval = evaluate!(\n mach,\n resampling=CV(),\n operation=predict,\n measure=[emp_coverage, ssc]\n)\nresults[:adaptive_inductive] = mach\ndisplay(_eval)\nprintln(\"Empirical coverage: $(round(_eval.measurement[1], digits=3))\")\nprintln(\"SSC: $(round(_eval.measurement[2], digits=3))\")\n\n\n\nPerformanceEvaluation object with these fields:\n measure, operation, measurement, per_fold,\n per_observation, fitted_params_per_fold,\n report_per_fold, train_test_rows\nExtract:\n┌──────────────────────────────────────────────┬───────────┬─────────────┬──────\n│ measure │ operation │ measurement │ 1.9 ⋯\n├──────────────────────────────────────────────┼───────────┼─────────────┼──────\n│ ConformalPrediction.emp_coverage │ predict │ 0.995 │ 0.0 ⋯\n│ ConformalPrediction.size_stratified_coverage │ predict │ 0.981 │ 0.0 ⋯\n└──────────────────────────────────────────────┴───────────┴─────────────┴──────\n 2 columns omitted\n\n\n\n\nEmpirical coverage: 0.995\nSSC: 0.981\n\n\nWe can also have a look at the resulting set size for both approaches using a custom Plots.jl recipe (fig-setsize). In line with the above, the spread is wider for the adaptive approach, which reflects that “the procedure is effectively distinguishing between easy and hard inputs” (A. N. Angelopoulos and Bates 2022).\n\n\nCode\nplt_list = []\nfor (_mod, mach) in results\n push!(plt_list, bar(mach.model, mach.fitresult, X; title=String(_mod)))\nend\nplot(plt_list..., size=(800,300))\nplot(plt_list..., size=(800,300),bg_colour=:transparent)\n\n\n\n\n\n\n\n \n \n \n\n\n \n \n \n\n\n \n \n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n \n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nFigure 3: Distribution of set sizes for both approaches."
+ "objectID": "blog/posts/guest-students-laplace/index.html#bayesian-learning",
+ "href": "blog/posts/guest-students-laplace/index.html#bayesian-learning",
+ "title": "Paving the Way Towards Low-Overhead Uncertainty Calibration",
+ "section": "Bayesian Learning",
+ "text": "Bayesian Learning\nUncertainty calibration remains a crucial issue in safety-critical applications of modern AI, as, for instance, in autonomous driving. You would want your car autopilot not only to make accurate predictions but also to indicate when a model prediction is uncertain, to give control back to the human driver.\nA model is well-calibrated if the confidence of a prediction matches its true error rate. Note that you can have well-fit models that are badly calibrated, and vice versa (just like in life, you meet smart people, yet annoyingly arrogant).\nThe standard deep learning training process of gradient descent converges at a weight configuration that minimizes the loss function. The model obtained may be great, yet it is only a point estimate of what the weight parameters should look like.\nHowever, with the sheer immensity of the weight space, neural networks are probably underspecified by the data (or, overfit). As neural networks can approximate highly complex functions, many weight configurations would yield roughly the same training loss, yet with varying abilities to generalize outside the training dataset. This is why there are so many regularization methods out there, to keep the models simpler. One radical, yet effective approach is described by LeCun, Denker, and Solla (1989):\n\n… it is possible to take a perfectly reasonable network, delete half (or more) of the weights and wind up with a network that works just as well, or better.\n\n\n\n\n\n\n\nFigure 1: The loss landscape. One can imagine gradient descent as a particle, let’s say a ball, or a grain of sand, rolling to the bottom of a pit. Then for Bayesian Learning, we have as if a pile of sand poured around at that bottom point, with the pile being thicker where loss is lower. This proverbial sand pile would represent the posterior parameter distribution. Figure due to Amini et al. (2019)\n\n\n\nThe way gradient is usually illustrated is with a picture like the one shown in Figure 1 above a curved terrain of the loss function across the parameter space. Each point of the horizontal plane corresponds to some configuration of parameters. Gradient descent seeks the point at the bottom of this terrain, as the point with the lowest loss, however as the loss-curvature is highly non-convex and high-dimensional there are many directions in which we could move and still maintain a low loss. Thus instead of a singular point we would like to specify a probability distribution around that optimal point. Bayesian methods, and in particular Laplace Approximations, allow us to do this!\nFirstly, the Bayesian approach to neural network uncertainty calibration is that of modelling the posterior using Bayes’ Theorem:\n\\[\np(\\theta \\mid \\mathcal{D}) = \\tfrac{1}{Z} \\,p(\\mathcal{D} \\mid \\theta) \\, p(\\theta), \\qquad Z:= p(\\mathcal{D}) = \\textstyle\\int p(\\mathcal{D} \\mid \\theta) \\, p(\\theta) \\,d\\theta\n\\]\nHere \\(p(\\mathcal{D} \\mid \\theta)\\) is the likelihood of the data given by the parameters \\(\\theta\\). The prior distribution \\(p(\\theta)\\) specifies our beliefs about what the model parameters would be prior to observing the data. Finally, the intractable constant \\(Z\\) is called the evidence: it characterizes the probability of observing \\(\\mathcal{D}\\) as a whole, across all possible parameter settings (see here for details).\nFor models returning a probability distribution (e.g. classifiers), the loss is commonly defined as the negative log-likelihood. Thus if gradient descent minimizes loss, it maximizes the likelihood, producing the maximum likelihood estimate (MLE), which (assuming a uniform prior) also maximizes the posterior. This is why we call this point the maximum a posteriori, or the MAP. It makes sense to model this point as the mode of the posterior distribution, which could, for example, be a normal Gaussian distribution (see also the introductory post on this blog)."
},
{
- "objectID": "blog/posts/conformal-image-classifier/index.html#recap",
- "href": "blog/posts/conformal-image-classifier/index.html#recap",
- "title": "How to Conformalize a Deep Image Classifier",
- "section": "🔁 Recap",
- "text": "🔁 Recap\nIn this short guide, we have seen how easy it is to conformalize a deep learning image classifier in Julia using ConformalPrediction.jl. Almost any deep neural network trained in Flux.jl is compatible with MLJ.jl and can therefore be conformalized in just a few lines of code. This makes it remarkably easy to move uncertainty heuristics to rigorous predictive uncertainty estimates. We have also seen a sneak peek at the performance evaluation of conformal predictors. Stay tuned for more!"
+ "objectID": "blog/posts/guest-students-laplace/index.html#laplace-approximations",
+ "href": "blog/posts/guest-students-laplace/index.html#laplace-approximations",
+ "title": "Paving the Way Towards Low-Overhead Uncertainty Calibration",
+ "section": "Laplace Approximations",
+ "text": "Laplace Approximations\nWe do this by a simple-yet-smart trick introduced back in the late 18th century by Pierre-Simon Laplace, the self-proclaimed “greatest French mathematician of his time”. In general, the Laplace Approximation (LA) aims to find a Gaussian approximation to a probability density (in our case, the posterior) defined over a set of continuous variables (in our case, the weights) (Bishop 2006). We can then estimate the loss (negative log-likelihood) as its second-order Taylor expansion:\n\\[\n\\mathcal{L}(\\mathcal{D}; \\theta) \\approx \\mathcal{L}(\\mathcal{D}; \\theta_\\text{MAP}) + \\tfrac{1}{2} (\\theta - \\theta_\\text{MAP})^\\intercal \\left( \\nabla^2 _\\theta \\mathcal{L}(\\mathcal{D}; \\theta) \\vert_{\\theta_\\text{MAP}} \\right)(\\theta - \\theta_\\text{MAP})\n\\]\nNote that the first-order Taylor term vanishes at the MAP since it contains the gradient, and the gradient is zero at MAP, since MAP is a maximum, by definition. What remains is the constant (zeroth-order) term, and the second-order term, containing the Hessian, which is a matrix of partial second-order derivatives.\nThen from this approximation, we can derive the long-sought multivariate normal distribution with the MAP as the mean, and the inverted Hessian as the covariance:\n\\[\np(\\theta \\mid \\mathcal{D}) \\approx N(\\theta; \\theta_\\text{MAP}, \\varSigma) \\qquad\\text{with}\\qquad \\varSigma := \\left( \\nabla^2_\\theta \\mathcal{L}(\\mathcal{D};\\theta) \\vert_{\\theta_\\text{MAP}} \\right)^{-1}\n\\]\nThe evidence \\(Z\\) is now also tractably approximated in closed form, allowing us to apply the Bayes’ theorem, to obtain the posterior distribution \\(p(\\theta \\mid \\mathcal{D})\\). We can then express the posterior predictive distribution, for an input \\(x_*\\), prediction \\(f(x_*)\\), to obtain the probability for an output \\(y\\).\nThe evidence \\(Z\\) is now also tractably approximated in closed form, allowing us to apply the Bayes’ theorem, to obtain the posterior distribution \\(p(\\theta \\mid \\mathcal{D})\\). We can then express the posterior predictive distribution, to obtain the probability for an output \\(y\\), given a prediction \\(f(x_*)\\) for an input \\(x_*\\).\n\\[\np(y \\mid f(x_*), \\mathcal{D}) = \\int p(y \\mid f_\\theta(x_*)) \\, p(\\theta \\mid \\mathcal{D}) \\,d\\theta\n\\]\nThis is what we are really after, after all — instead of giving one singular point-estimate prediction \\(\\widehat{y} = f(x_*)\\), we make the neural network give a distribution over \\(y\\).\nHowever, since the Hessian, a square matrix, defines the covariance between all model parameters (upon inversion), of which there may be millions or billions, the computation and storage of the Hessian (not to speak of inversion!) become intractable, as its size scales quadratically with the number of parameters involved. Thus to apply Laplace approximations to large models, we must make some simplifications — which brings us to…"
},
{
- "objectID": "blog/posts/conformal-image-classifier/index.html#footnotes",
- "href": "blog/posts/conformal-image-classifier/index.html#footnotes",
- "title": "How to Conformalize a Deep Image Classifier",
- "section": "Footnotes",
- "text": "Footnotes\n\n\nFor a full tutorial on how to build an MNIST image classifier relying solely on Flux.jl, check out this tutorial.↩︎"
+ "objectID": "blog/posts/guest-students-laplace/index.html#hessian-approximations",
+ "href": "blog/posts/guest-students-laplace/index.html#hessian-approximations",
+ "title": "Paving the Way Towards Low-Overhead Uncertainty Calibration",
+ "section": "Hessian approximations",
+ "text": "Hessian approximations\nMultiple techniques to approximate the Hessian have arisen from a field adjacent, yet distinct from Bayesian learning — that of second-order optimization, where Hessians are used to accelerate gradient descent convergence.\nOne such approximation is the Fisher information matrix, or simply the Fisher:\n\\[\nF := \\textstyle\\sum_{n=1}^N \\mathbb{E}_{\\widehat{y} \\sim p(y \\mid f_\\theta(x_n))} \\left[ gg^\\intercal \\right] \\quad\\text{with}\\quad g = \\nabla_\\theta \\log p(\\widehat{y} \\mid f_\\theta(x_n)) \\large\\vert_{\\theta_\\text{MAP}}\n\\]\nNote that if instead of sampling the prediction \\(\\widehat{y} ~ p(y \\mid f(x_n))\\) from the model-defined distribution, we take the actual training-set label \\(y_n\\), the resulting matrix is called the empirical Fisher, which is distinct from the Fisher, yet aligns with it under some conditions, and does not generally capture second-order information. See Kunstner et al. (2019) for an excellent discussion on the distinction.\nInstead of the Fisher, one can use the Generalized Gauss-Newton (GGN):\n\\[\nG := \\textstyle\\sum_{n=1}^N J(x_n) \\left( \\nabla^2_{f} \\log p(y_n \\mid f) \\Large\\vert_{f=f_{\\theta_\\text{map}}(x_n)} \\right) J(x_n)^\\intercal\n\\text{with}\\qquad J(x_n) := \\nabla_\\theta f_\\theta(x_n) \\vert_{\\theta_\\text{map}}\n\\]\nHere \\(J(x_n)\\) represents the Jacobian of the model output w.r.t. the parameters. The middle factor \\(\\nabla^2 …\\) is a Hessian of log-likelihood of \\(y_n\\) w.r.t. model output. Note that the model does not necessarily output ready target probabilities — for instance, classifiers output logits, values that define a probability distribution only after the application of the soft-max.\nUnlike the Fisher, GGN does not require the network to define a probabilistic model on its output (Botev, Ritter, and Barber 2017). For models defining an exponential family distribution over the output, the two coincide (Kunstner, Balles, and Hennig 2020). This applies to classifiers since they define a categorical distribution over the output, but not to simple regression models.\nThese matrices are quadratically large, it is infeasible to store them in full. The simplest estimation is to model the matrix as a diagonal — however one can easily contemplate how crude this approximation can be: for 100 parameters, only 1% of the full Hessian is captured.\nA more sophisticated approach, due to Martens and Grosse (2015), is inspired by the observation that in practice the covariance matrices (i.e. inverted Hessians) for neural networks are block-diagonal-dominant. Thus we can effectively model the covariance matrix (and hence the Fisher) as a block-diagonal matrix, where blocks correspond to parameters grouped by layers. Additionally, each block is decomposed into two Kronecker factors, reducing the size of data stored several magnitudes more, at a cost of another assumption.\nLastly, a novel approach is to sketch a low-rank approximation of the Fisher (Sharma, Azizan, and Pavone 2021). Figure 2 shows four Hessian approximation structures:\n\n\n\n\n\n\nFigure 2: (a) Hessian in full, intractable for large networks. (b) Low-rank. (c) Kronecker-factored Approximate Curvature, a block-diagonal method. (d) Diagonal. Source: Daxberger et al. (2021)\n\n\n\nIt is also possible to cut the costs by treating only a subset of the model parameters, i.e. a subnetwork, probabilistically, fixing the remaining parameters at their MAP-estimated values. One special case of subnetwork Laplace that was found to perform well in practice is last-layer Laplace, where the selected subnetwork contains only the weights and biases of the last layer."
},
{
- "objectID": "blog/posts/a-new-tool-for-explainable-ai/index.html",
- "href": "blog/posts/a-new-tool-for-explainable-ai/index.html",
- "title": "A new tool for explainable AI",
- "section": "",
- "text": "Turning a 9 (nine) into a 4 (four).\nCounterfactual explanations, which I introduced in one of my previous posts1, offer a simple and intuitive way to explain black-box models without opening them. Still, as of today there exists only one open-source library that provides a unifying approach to generate and benchmark counterfactual explanations for models built and trained in Python (Pawelczyk et al. 2021). This is great, but of limited use to users of other programming languages 🥲.\nEnter CounterfactualExplanations.jl: a Julia package that can be used to explain machine learning algorithms developed and trained in Julia, Python and R. Counterfactual explanations fall into the broader category of explainable artificial intelligence (XAI).\nExplainable AI typically involves models that are not inherently interpretable but require additional tools to be explainable to humans. Examples of the latter include ensembles, support vector machines and deep neural networks. This is not to be confused with interpretable AI, which involves models that are inherently interpretable and transparent such as general additive models (GAM), decision trees and rule-based models.\nSome would argue that we best avoid explaining black-box models altogether (Rudin 2019) and instead focus solely on interpretable AI. While I agree that initial efforts should always be geared towards interpretable models, stopping there would entail missed opportunities and anyway is probably not very realistic in times of DALL\\(\\cdot\\)E and Co.\nThis post introduces the main functionality of the new Julia package. Following a motivating example using a model trained in Julia, we will see how easy the package can be adapted to work with models trained in Python and R. Since the motivation for this post is also to hopefully attract contributors, the final section outlines some of the exciting developments we have planned."
+ "objectID": "blog/posts/guest-students-laplace/index.html#our-contributions-to-laplaceredux.jl",
+ "href": "blog/posts/guest-students-laplace/index.html#our-contributions-to-laplaceredux.jl",
+ "title": "Paving the Way Towards Low-Overhead Uncertainty Calibration",
+ "section": "Our contributions to LaplaceRedux.jl",
+ "text": "Our contributions to LaplaceRedux.jl\nIn the scope of the project we have added support for: - multi-class classification, in addition to regression and binary classification; - GGN, in addition to empirical Fisher; - hardware-parallelized batched computation of both the empirical Fisher and the GGN; - subnetwork and last-layer Laplace; - KFAC for multi-class classification with Fisher; and - interfacing with MLJ, a common machine learning framework for Julia.\nWe have also made quality assurance / quality-of-life additions to the repository, adding: - a formatting check in the CI/CD pipeline; - an extensive test suite comparing the results of LaplaceRedux.jl against those of its Python counter-part package laplace-torch; and - a benchmark pipeline tracking possible downturns in performance."
},
{
- "objectID": "blog/posts/a-new-tool-for-explainable-ai/index.html#counterfactuals-for-image-data",
- "href": "blog/posts/a-new-tool-for-explainable-ai/index.html#counterfactuals-for-image-data",
- "title": "A new tool for explainable AI",
- "section": "Counterfactuals for image data 🖼",
- "text": "Counterfactuals for image data 🖼\nTo introduce counterfactual explanations I used a simple binary classification problem in my previous post. It involved a linear classifier and a linearly separable, synthetic data set with just two features. This time we are going to step it up a notch: we will generate counterfactual explanations MNIST data. The MNIST dataset contains 60,000 training samples of handwritten digits in the form of 28x28 pixel grey-scale images (LeCun 1998). Each image is associated with a label indicating the digit (0-9) that the image represents.\nThe CounterfactualExplanations.jl package ships with two black-box models that were trained to predict labels for this data: firstly, a simple multi-layer perceptron (MLP) and, secondly, a corresponding deep ensemble. Originally proposed by Lakshminarayanan, Pritzel, and Blundell (2017), deep ensembles are really just ensembles of deep neural networks. They are still among the most popular approaches to Bayesian deep learning.2\n\nBlack-box models\nThe code below loads relevant packages along with the MNIST data and pre-trained models.\n\n\nCode\n# Load package, models and data:\nusing CounterfactualExplanations, Flux\nusing CounterfactualExplanations.Data: mnist_data, mnist_model, mnist_ensemble\ndata, X, ys = mnist_data()\nmodel = mnist_model()\nensemble = mnist_ensemble()\ncounterfactual_data = CounterfactualData(X,ys;domain=(0,1))\n\n\nWhile the package can currently handle a few simple classification models natively, it is designed to be easily extensible through users and contributors. Extending the package to deal with custom models typically involves only two simple steps:\n\nSubtyping: the custom model needs to be declared as a subtype of the package-internal type AbstractFittedModel.\nMultiple dispatch: the package-internal functions logits and probs need to be extended through custom methods for the new model type.\n\nThe following code implements these two steps first for the MLP and then for the deep ensemble.\n\n\nCode\nusing CounterfactualExplanations.Models\nimport CounterfactualExplanations.Models: logits, probs\n# MLP:\n# Step 1)\nstruct NeuralNetwork <: Models.AbstractFittedModel\n model::Any\nend\n# Step 2)\nlogits(M::NeuralNetwork, X::AbstractArray) = M.model(X)\nprobs(M::NeuralNetwork, X::AbstractArray)= softmax(logits(M, X))\nM = NeuralNetwork(model)\n\n# Deep ensemble:\nusing Flux: stack\n# Step 1)\nstruct FittedEnsemble <: Models.AbstractFittedModel\n ensemble::AbstractArray\nend\n# Step 2)\nusing Statistics\nlogits(M::FittedEnsemble, X::AbstractArray) = mean(stack([m(X) for m in M.ensemble],3),dims=3)\nprobs(M::FittedEnsemble, X::AbstractArray) = mean(stack([softmax(m(X)) for m in M.ensemble],3),dims=3)\nM_ensemble = FittedEnsemble(ensemble)\n\n\n\n\nCounterfactual generators\nNext, we need to specify the counterfactual generators we want to use. The package currently ships with two default generators that both need gradient access: firstly, the generic generator introduced by Wachter, Mittelstadt, and Russell (2017) and, secondly, a greedy generator introduced by Schut et al. (2021).\nThe greedy generator is designed to be used with models that incorporate uncertainty in their predictions such as the deep ensemble introduced above. It works for probabilistic (Bayesian) models, because they only produce high-confidence predictions in regions of the feature domain that are populated by training samples. As long as the model is expressive enough and well-specified, counterfactuals in these regions will always be realistic and unambiguous since by construction they should look very similar to training samples. Other popular approaches to counterfactual explanations like REVISE (Joshi et al. 2019) and CLUE (Antorán et al. 2020) also play with this simple idea.\nThe following code instantiates the two generators for the problem at hand.\n\n\nCode\ngeneric = GenericGenerator(;loss=:logitcrossentropy)\ngreedy = GreedyGenerator(;loss=:logitcrossentropy)\n\n\n\n\nExplanations\nOnce the model and counterfactual generator are specified, running counterfactual search is very easy using the package. For a given factual (x), target class (target) and data set (counterfactual_data), simply running\n\ngenerate_counterfactual(x, target, counterfactual_data, M, generic)\n\nwill generate the results, in this case using the generic generator (generic) for the MLP (M). Since we have specified two different black-box models and two different counterfactual generators, we have four combinations of a model and a generator in total. For each of these combinations I have used the generate_counterfactual function to produce the results in Figure 1.\nIn every case the desired label switch is in fact achieved, but arguably from a human perspective only the counterfactuals for the deep ensemble look like a four. The generic generator produces mild perturbations in regions that seem irrelevant from a human perspective, but nonetheless yields a counterfactual that can pass as a four. The greedy approach clearly targets pixels at the top of the handwritten nine and yields the best result overall. For the non-Bayesian MLP, both the generic and the greedy approach generate counterfactuals that look much like adversarial examples: they perturb pixels in seemingly random regions on the image.\n\n\n\n\n\n\nFigure 1: Counterfactual explanations for MNIST: turning a nine (9) into a four (4)."
+ "objectID": "blog/posts/guest-students-laplace/index.html#methodology",
+ "href": "blog/posts/guest-students-laplace/index.html#methodology",
+ "title": "Paving the Way Towards Low-Overhead Uncertainty Calibration",
+ "section": "Methodology",
+ "text": "Methodology\nWe adhered to the Agile/Scrum practices, with two-week-long sprints, and weekly meetings with our formal client, Patrick Altmeyer. We have prioritized the expected requirements by the Moscow method into must-, could-, should-, and won’t-haves. This is all fairly standard for BSc software projects at TU Delft. By the end of the project, we have completed all of our self-assigned must-haves and should-haves."
},
{
- "objectID": "blog/posts/a-new-tool-for-explainable-ai/index.html#language-interoperability",
- "href": "blog/posts/a-new-tool-for-explainable-ai/index.html#language-interoperability",
- "title": "A new tool for explainable AI",
- "section": "Language interoperability 👥",
- "text": "Language interoperability 👥\nThe Julia language offers unique support for programming language interoperability. For example, calling R or Python is made remarkably easy through RCall.jl and PyCall.jl, respectively. This functionality can be leveraged to use CounterfactualExplanations.jl to generate explanations for models that were developed in other programming languages. At this time there is no native support for foreign programming languages, but the following example involving a torch neural network trained in R demonstrates how versatile the package is.3\n\nExplaining a torch model\nWe will consider a simple MLP trained for a binary classification task. As before we first need to adapt this custom model for use with our package. The code below the two necessary steps - sub-typing and method extension. Logits are returned by the torch model and copied from the R environment into the Julia scope. Probabilities are then computed inside the Julia scope by passing the logits through the sigmoid function.\n\n\nCode\nusing Flux\nusing CounterfactualExplanations, CounterfactualExplanations.Models\nimport CounterfactualExplanations.Models: logits, probs # import functions in order to extend\n\n# Step 1)\nstruct TorchNetwork <: Models.AbstractFittedModel\n nn::Any\nend\n\n# Step 2)\nfunction logits(M::TorchNetwork, X::AbstractArray)\n nn = M.nn\n y = rcopy(R\"as_array($nn(torch_tensor(t($X))))\")\n y = isa(y, AbstractArray) ? y : [y]\n return y'\nend\nfunction probs(M::TorchNetwork, X::AbstractArray)\n return σ.(logits(M, X))\nend\nM = TorchNetwork(R\"model\")\n\n\nCompared to models trained in Julia, we need to do a little more work at this point. Since our counterfactual generators need gradient access, we essentially need to allow our package to communicate with the R torch library. While this may sound daunting, it turns out to be quite manageable: all we have to do is respecify the function that computes the gradient with respect to the counterfactual loss function so that it can deal with the TorchNetwork type we defined above. That is all the adjustment needed to use CounterfactualExplanations.jl for our custom R model. Figure 2 shows a counterfactual path for a randomly chosen sample with respect to the MLP trained in R.\n\n\n\n\n\n\nExperimental functionality\n\n\n\nYou may have stumbled across the term respecify above: does it really seem like a good idea to just replace an existing function from our package? Surely not! There are certainly better ways to go about this, which we will consider when adding native support for Python and R models in future package releases. Which brings us to our final section …\n\n\n\n\nCode\nimport CounterfactualExplanations.Generators: ∂ℓ\nusing LinearAlgebra\n\n# Countefactual loss:\nfunction ∂ℓ(\n generator::AbstractGradientBasedGenerator, \n counterfactual_state::CounterfactualState) \n M = counterfactual_state.M\n nn = M.nn\n x′ = counterfactual_state.x′\n t = counterfactual_state.target_encoded\n R\"\"\"\n x <- torch_tensor($x′, requires_grad=TRUE)\n output <- $nn(x)\n loss_fun <- nnf_binary_cross_entropy_with_logits\n obj_loss <- loss_fun(output,$t)\n obj_loss$backward()\n \"\"\"\n grad = rcopy(R\"as_array(x$grad)\")\n return grad\nend\n\n\n\n\n\n\n\n\nFigure 2: Counterfactual path using the generic counterfactual generator for a model trained in R."
+ "objectID": "blog/posts/guest-students-laplace/index.html#pain-points",
+ "href": "blog/posts/guest-students-laplace/index.html#pain-points",
+ "title": "Paving the Way Towards Low-Overhead Uncertainty Calibration",
+ "section": "Pain Points",
+ "text": "Pain Points\nHere we list some obstacles we have encountered along the way: - Julia is slow to compile and load dependencies on less powerful machines. - Stack traces are sometimes rather obscure, though it seems to be the price to pay for macros. - Zygote.jl, the automatic differentiation library, is not self-autodifferentiable – it cannot differentiate its own functions. We would want this since we apply Zygote.jacobians when making predictions with the LA. - There is no accessible tool reporting branch coverage on tests – only line coverage is available. - Limited LSP and Unicode support for Jupyter Lab. - Conversion between Flux and ONNX is not yet implemented. - There is no extension library for Zygote equivalent to BackPACK or ASDL for second-order information.\n\nZygote.jl, the automatic differentiation library, is not self-autodifferentiable: issue. We would want this since we apply Zygote.jacobians when making predictions with the LA.\nThere is no accessible tool reporting branch coverage on tests – only line coverage is available.\nLimited LSP and Unicode support for Jupyter Lab.\nNo conversion between Flux and ONNX is implemented yet ONNX.jl\nThere is no extension library for Zygote equivalent to BackPACK or ASDL for second-order information."
},
{
- "objectID": "blog/posts/a-new-tool-for-explainable-ai/index.html#we-need-you",
- "href": "blog/posts/a-new-tool-for-explainable-ai/index.html#we-need-you",
- "title": "A new tool for explainable AI",
- "section": "We need you! 🫵",
- "text": "We need you! 🫵\nThe ambition for CounterfactualExplanations.jl is to provide a go-to place for counterfactual explanations to the Julia community and beyond. This is a grand ambition, especially for a package that has so far been built by a single developer who has little prior experience with Julia. We would therefore very much like to invite community contributions. If you have an interest in trustworthy AI, the open-source community and Julia, please do get involved! This package is still in its early stages of development, so any kind of contribution is welcome: advice on the core package architecture, pull requests, issues, discussions and even just comments below would be much appreciated.\nTo give you a flavor of what type of future developments we envision, here is a non-exhaustive list:\n\nNative support for additional counterfactual generators and predictive models including those built and trained in Python or R.\nAdditional datasets for testing, evaluation and benchmarking.\nImproved preprocessing including native support for categorical features.\nSupport for regression models.\n\nFinally, if you like this project but don’t have much time, then simply sharing this article or starring the repo on GitHub would also go a long way."
+ "objectID": "blog/posts/guest-students-laplace/index.html#highlights",
+ "href": "blog/posts/guest-students-laplace/index.html#highlights",
+ "title": "Paving the Way Towards Low-Overhead Uncertainty Calibration",
+ "section": "Highlights",
+ "text": "Highlights\nAnd here is what we found refreshing: - Metaprogramming and first-class support for macros are something completely different for students who are used to Java & Python. - The Julia standard API, and Flux/Zygote, are fairly straightforward to use, and well-thought-out for numerical computing and machine learning."
},
{
- "objectID": "blog/posts/a-new-tool-for-explainable-ai/index.html#further-reading",
- "href": "blog/posts/a-new-tool-for-explainable-ai/index.html#further-reading",
- "title": "A new tool for explainable AI",
- "section": "Further reading 📚",
- "text": "Further reading 📚\nIf you’re interested in learning more about this development, feel free to check out the following resources:\n\nPackage docs: [stable], [dev].\nContributor’s guide.\nGitHub repo."
+ "objectID": "blog/posts/guest-students-laplace/index.html#conclusions",
+ "href": "blog/posts/guest-students-laplace/index.html#conclusions",
+ "title": "Paving the Way Towards Low-Overhead Uncertainty Calibration",
+ "section": "Conclusions",
+ "text": "Conclusions\nWe have covered some elements of the theory behind Laplace Approximations, laid down our additions to the LaplaceRedux.jl package, and brought out some difficulties we, as complete newcomers to Julia, came across. Hope you have enjoyed the tour, and hopefully it has intrigued you enough to look deeper into Bayesian learning and/or Julia since both are developing at a lively pace. You can check out LaplaceRedux on the JuliaTrustworthyAI GitHub page here. Contributions and comments are welcome!"
},
{
- "objectID": "blog/posts/a-new-tool-for-explainable-ai/index.html#footnotes",
- "href": "blog/posts/a-new-tool-for-explainable-ai/index.html#footnotes",
- "title": "A new tool for explainable AI",
- "section": "Footnotes",
- "text": "Footnotes\n\n\nSee: [TDS], [blog]↩︎\nFor more information on Bayesian deep learning see my previous post: [TDS], [blog].↩︎\nThe corresponding example involving PyTorch is analogous and therefore not included here. You may find it here.↩︎"
+ "objectID": "blog/posts/guest-students-laplace/index.html#acknowedgements",
+ "href": "blog/posts/guest-students-laplace/index.html#acknowedgements",
+ "title": "Paving the Way Towards Low-Overhead Uncertainty Calibration",
+ "section": "Acknowedgements",
+ "text": "Acknowedgements\nOur team members are Mark Ardman, Severin Bratus, Adelina Cazacu, Andrei Ionescu, and Ivan Makarov. We would like to thank Patrick Altmeyer for the opportunity to work on this unique project and for the continuous guidance throughout the development process. We are also grateful to Sebastijan Dumančić, our coach, Sven van der Voort, our TA mentor, and Antony Bartlett, our supporting advisor."
},
{
- "objectID": "blog/posts/effortsless-bayesian-dl/index.html",
- "href": "blog/posts/effortsless-bayesian-dl/index.html",
- "title": "Go deep, but also … go Bayesian!",
+ "objectID": "blog/posts/conformal-prediction/index.html",
+ "href": "blog/posts/conformal-prediction/index.html",
+ "title": "Conformal Prediction in Julia 🟣🔴🟢",
"section": "",
- "text": "A Bayesian Neural Network gradually learns.\nDeep learning has dominated AI research in recent years1 - but how much promise does it really hold? That is very much an ongoing and increasingly polarising debate that you can follow live on Twitter. On one side you have optimists like Ilya Sutskever, chief scientist of OpenAI, who believes that large deep neural networks may already be slightly conscious - that’s “may” and “slightly” and only if you just go deep enough? On the other side you have prominent skeptics like Judea Pearl who has long since argued that deep learning still boils down to curve fitting - purely associational and not even remotely intelligent (Pearl and Mackenzie 2018)."
+ "text": "Prediction sets for two different samples and changing coverage rates. As coverage grows, so does the size of the prediction sets.\nA first crucial step towards building trustworthy AI systems is to be transparent about predictive uncertainty. Model parameters are random variables and their values are estimated from noisy data. That inherent stochasticity feeds through to model predictions and should to be addressed, at the very least in order to avoid overconfidence in models.\nBeyond that obvious concern, it turns out that quantifying model uncertainty actually opens up a myriad of possibilities to improve up- and down-stream modeling tasks like active learning and robustness. In Bayesian Active Learning, for example, uncertainty estimates are used to guide the search for new input samples, which can make ground-truthing tasks more efficient (Houlsby et al. 2011). With respect to model performance in downstream tasks, uncertainty quantification can be used to improve model calibration and robustness (Lakshminarayanan, Pritzel, and Blundell 2017).\nIn previous posts we have looked at how uncertainty can be quantified in the Bayesian context (see here and here). Since in Bayesian modeling we are generally concerned with estimating posterior distributions, we get uncertainty estimates almost as a byproduct. This is great for all intends and purposes, but it hinges on assumptions about prior distributions. Personally, I have no quarrel with the idea of making prior distributional assumptions. On the contrary, I think the Bayesian framework formalizes the idea of integrating prior information in models and therefore provides a powerful toolkit for conducting science. Still, in some cases this requirement may be seen as too restrictive or we may simply lack prior information.\nEnter: Conformal Prediction (CP) — a scalable frequentist approach to uncertainty quantification and coverage control. In this post we will go through the basic concepts underlying CP. A number of hands-on usage examples in Julia should hopefully help to convey some intuition and ideally attract people interested in contributing to a new and exciting open-source development."
},
{
- "objectID": "blog/posts/effortsless-bayesian-dl/index.html#the-case-for-bayesian-deep-learning",
- "href": "blog/posts/effortsless-bayesian-dl/index.html#the-case-for-bayesian-deep-learning",
- "title": "Go deep, but also … go Bayesian!",
- "section": "The case for Bayesian Deep Learning",
- "text": "The case for Bayesian Deep Learning\nWhatever side of this entertaining twitter dispute you find yourself on, the reality is that deep-learning systems have already been deployed at large scale both in academia and industry. More pressing debates therefore revolve around the trustworthiness of these existing systems. How robust are they and in what way exactly do they arrive at decisions that affect each and every one of us? Robustifying deep neural networks generally involves some form of adversarial training, which is costly, can hurt generalization (Raghunathan et al. 2019) and does ultimately not guarantee stability (Bastounis, Hansen, and Vlačić 2021). With respect to interpretability, surrogate explainers like LIME and SHAP are among the most popular tools, but they too have been shown to lack robustness (Slack et al. 2020).\nExactly why are deep neural networks unstable and in-transparent? Let \\(\\mathcal{D}=\\{x,y\\}_{n=1}^N\\) denote our feature-label pairs and let \\(f(x;\\theta)=y\\) denote some deep neural network specified by its parameters \\(\\theta\\). Then the first thing to note is that the number of free parameters \\(\\theta\\) is typically huge (if you ask Mr Sutskever it really probably cannot be huge enough!). That alone makes it very hard to monitor and interpret the inner workings of deep-learning algorithms. Perhaps more importantly though, the number of parameters relative to the size of \\(\\mathcal{D}\\) is generally huge:\n\n[…] deep neural networks are typically very underspecified by the available data, and […] parameters [therefore] correspond to a diverse variety of compelling explanations for the data. (Wilson 2020)\n\nIn other words, training a single deep neural network may (and usually does) lead to one random parameter specification that fits the underlying data very well. But in all likelihood there are many other specifications that also fit the data very well. This is both a strength and vulnerability of deep learning: it is a strength because it typically allows us to find one such “compelling explanation” for the data with ease through stochastic optimization; it is a vulnerability because one has to wonder:\n\nHow compelling is an explanation really if it competes with many other equally compelling, but potentially very different explanations?\n\nA scenario like this very much calls for treating predictions from deep learning models probabilistically (Wilson 2020)23.\nFormally, we are interested in estimating the posterior predictive distribution as the following Bayesian model average (BMA):\n\\[\np(y|x,\\mathcal{D}) = \\int p(y|x,\\theta)p(\\theta|\\mathcal{D})d\\theta\n\\]\nThe integral implies that we essentially need many predictions from many different specifications of \\(\\theta\\). Unfortunately, this means more work for us or rather our computers. Fortunately though, researchers have proposed many ingenious ways to approximate the equation above in recent years: Gal and Ghahramani (2016) propose using dropout at test time while Lakshminarayanan, Pritzel, and Blundell (2017) show that averaging over an ensemble of just five models seems to do the trick. Still, despite their simplicity and usefulness these approaches involve additional computational costs compared to training just a single network. As we shall see now though, another promising approach has recently entered the limelight: Laplace approximation (LA).\nIf you have read my previous post on Bayesian Logistic Regression, then the term Laplace should already sound familiar to you. As a matter of fact, we will see that all concepts covered in that previous post can be naturally extended to deep learning. While some of these concepts will be revisited below, I strongly recommend you check out the previous post before reading on here. Without further ado let us now see how LA can be used for truly effortless deep learning."
+ "objectID": "blog/posts/conformal-prediction/index.html#sec-background",
+ "href": "blog/posts/conformal-prediction/index.html#sec-background",
+ "title": "Conformal Prediction in Julia 🟣🔴🟢",
+ "section": "📖 Background",
+ "text": "📖 Background\nConformal Prediction promises to be an easy-to-understand, distribution-free and model-agnostic way to generate statistically rigorous uncertainty estimates. That’s quite a mouthful, so let’s break it down: firstly, as I will hopefully manage to illustrate in this post, the underlying concepts truly are fairly straight-forward to understand; secondly, CP indeed relies on only minimal distributional assumptions; thirdly, common procedures to generate conformal predictions really do apply almost universally to all supervised models, therefore making the framework very intriguing to the ML community; and, finally, CP does in fact come with a frequentist coverage guarantee that ensures that conformal prediction sets contain the true value with a user-chosen probability. For a formal proof of this marginal coverage property and a detailed introduction to the topic, I recommend Angelopoulos and Bates (2022).\n\n\n\n\n\n\nNote\n\n\n\nIn what follows we will loosely treat the tutorial by Angelopoulos and Bates (2022) and the general framework it sets as a reference. You are not expected to have read the paper, but I also won’t reiterate any details here.\n\n\nCP can be used to generate prediction intervals for regression models and prediction sets for classification models (more on this later). There is also some recent work on conformal predictive distributions and probabilistic predictions. Interestingly, it can even be used to complement Bayesian methods. Angelopoulos and Bates (2022), for example, point out that prior information should be incorporated into prediction sets and demonstrate how Bayesian predictive distributions can be conformalized in order to comply with the frequentist notion of coverage. Relatedly, Hoff (2021) proposes a Bayes-optimal prediction procedure. And finally, Stanton, Maddox, and Wilson (2022) very recently proposed a way to introduce conformal prediction in Bayesian Optimization. I find this type of work that combines different schools of thought very promising, but I’m drifting off a little … So, without further ado, let us look at some code."
},
{
- "objectID": "blog/posts/effortsless-bayesian-dl/index.html#laplace-approximation",
- "href": "blog/posts/effortsless-bayesian-dl/index.html#laplace-approximation",
- "title": "Go deep, but also … go Bayesian!",
- "section": "Laplace Approximation",
- "text": "Laplace Approximation\nWhile LA was first proposed in the 18th century, it has so far not attracted serious attention from the deep learning community largely because it involves a possibly large Hessian computation. Daxberger et al. (2021) are on a mission to change the perception that LA has no use in DL: in their NeurIPS 2021 paper they demonstrate empirically that LA can be used to produce Bayesian model averages that are at least at par with existing approaches in terms of uncertainty quantification and out-of-distribution detection and significantly cheaper to compute. They show that recent advancements in autodifferentation can be leveraged to produce fast and accurate approximations of the Hessian and even provide a fully-fledged Python library that can be used with any pretrained Torch model. For this post, I have built a much less comprehensive, pure-play equivalent of their package in Julia - LaplaceRedux.jl can be used with deep learning models built in Flux.jl, which is Julia’s main DL library. As in the previous post on Bayesian logistic regression I will rely on Julia code snippits instead of equations to convey the underlying maths. If you’re curious about the maths, the NeurIPS 2021 paper provides all the detail you need.\n\nFrom Bayesian Logistic Regression …\nLet’s recap: in the case of logistic regression we had a assumed a zero-mean Gaussian prior \\(p(\\mathbf{w}) \\sim \\mathcal{N} \\left( \\mathbf{w} | \\mathbf{0}, \\sigma_0^2 \\mathbf{I} \\right)=\\mathcal{N} \\left( \\mathbf{w} | \\mathbf{0}, \\mathbf{H}_0^{-1} \\right)\\) for the weights that are used to compute logits \\(\\mu_n=\\mathbf{w}^T\\mathbf{x}_n\\), which in turn are fed to a sigmoid function to produce probabilities \\(p(y_n=1)=\\sigma(\\mu_n)\\). We saw that under this assumption solving the logistic regression problem corresponds to minimizing the following differentiable loss function:\n\\[\n\\ell(\\mathbf{w})= - \\sum_{n}^N [y_n \\log \\mu_n + (1-y_n)\\log (1-\\mu_n)] + \\\\ \\frac{1}{2} (\\mathbf{w}-\\mathbf{w}_0)^T\\mathbf{H}_0(\\mathbf{w}-\\mathbf{w}_0)\n\\]\nAs our first step towards Bayesian deep learning, we observe the following: the loss function above corresponds to the objective faced by a single-layer artificial neural network with sigmoid activation and weight decay4. In other words, regularized logistic regression is equivalent to a very simple neural network architecture and hence it is not surprising that underlying concepts can in theory be applied in much the same way.\nSo let’s quickly recap the next core concept: LA relies on the fact that the second-order Taylor expansion of our loss function \\(\\ell\\) evaluated at the maximum a posteriori (MAP) estimate \\(\\mathbf{\\hat{w}}=\\arg\\max_{\\mathbf{w}} p(\\mathbf{w}|\\mathcal{D})\\) amounts to a multi-variate Gaussian distribution. In particular, that Gaussian is centered around the MAP estimate with covariance equal to the inverse Hessian evaluated at the mode \\(\\hat{\\Sigma}=(\\mathbf{H}(\\mathbf{\\hat{w}}))^{-1}\\) (Murphy 2022).\nThat is basically all there is to the story: if we have a good estimate of \\(\\mathbf{H}(\\mathbf{\\hat{w}})\\) we have an analytical expression for an (approximate) posterior over parameters. So let’s go ahead and start by run Bayesian Logistic regression using Flux.jl. We begin by loading some required packages including LaplaceRedux.jl. It ships with a helper function toy_data_linear that creates a toy data set composed of linearly separable samples evenly balanced across the two classes.\n\n\nCode\n# Import libraries.\nusing Flux, Plots, Random, PlotThemes, Statistics, LaplaceRedux\ntheme(:wong)\n# Number of points to generate.\nxs, y = toy_data_linear(100)\nX = hcat(xs...); # bring into tabular format\ndata = zip(xs,y);\n\n\nThen we proceed to prepare the single-layer neural network with weight decay. The term \\(\\lambda\\) determines the strength of the \\(\\ell2\\) penalty: we regularize parameters \\(\\theta\\) more heavily for higher values. Equivalently, we can say that from the Bayesian perspective it governs the strength of the prior \\(p(\\theta) \\sim \\mathcal{N} \\left( \\theta | \\mathbf{0}, \\sigma_0^2 \\mathbf{I} \\right)= \\mathcal{N} \\left( \\mathbf{w} | \\mathbf{0}, \\lambda_0^{-2} \\mathbf{I} \\right)\\): a higher value of \\(\\lambda\\) indicates a higher conviction about our prior belief that \\(\\theta=\\mathbf{0}\\), which is of course equivalent to regularizing more heavily. The exact choice of \\(\\lambda=0.5\\) for this toy example is somewhat arbitrary (it made for good visualizations below). Note that I have used \\(\\theta\\) to denote our neural parameters to distinguish the case from Bayesian logistic regression, but we are in fact still solving the same problem.\n\n\nCode\nnn = Chain(Dense(2,1))\nλ = 0.5\nsqnorm(x) = sum(abs2, x)\nweight_regularization(λ=λ) = 1/2 * λ^2 * sum(sqnorm, Flux.params(nn))\nloss(x, y) = Flux.Losses.logitbinarycrossentropy(nn(x), y) + weight_regularization();\n\n\nBefore we apply Laplace approximation we train our model:\n\n\nCode\nusing Flux.Optimise: update!, ADAM\nopt = ADAM()\nepochs = 50\n\nfor epoch = 1:epochs\n for d in data\n gs = gradient(params(nn)) do\n l = loss(d...)\n end\n update!(opt, params(nn), gs)\n end\nend\n\n\nUp until this point we have just followed the standard recipe for training a regularized artificial neural network in Flux.jl for a simple binary classification task. To compute the Laplace approximation using LaplaceRedux.jl we need just two more lines of code:\n\n\nCode\nla = laplace(nn, λ=λ)\nfit!(la, data);\n\n\nUnder the hood the Hessian is approximated through the empirical Fisher, which can be computed using only the gradients of our loss function \\(\\nabla_{\\theta}\\ell(f(\\mathbf{x}_n;\\theta,y_n))\\) where \\(\\{\\mathbf{x}_n,y_n\\}\\) are training data (see NeurIPS 2021 paper for details). Finally, LaplaceRedux.jl ships with a function predict(𝑳::LaplaceRedux, X::AbstractArray; link_approx=:probit) that computes the posterior predictive using a probit approximation, much like we saw in the previous post. That function is used under the hood of the plot_contour function below to create the right panel of Figure 1. It visualizes the posterior predictive distribution in the 2D feature space. For comparison I have added the corresponding plugin estimate as well. Note how for the Laplace approximation the predicted probabilities fan out indicating that confidence decreases in regions scarce of data.\n\n\nCode\np_plugin = plot_contour(X',y,la;title=\"Plugin\",type=:plugin);\np_laplace = plot_contour(X',y,la;title=\"Laplace\")\n# Plot the posterior distribution with a contour plot.\nplt = plot(p_plugin, p_laplace, layout=(1,2), size=(1000,400))\nsavefig(plt, \"www/posterior_predictive_logit.png\");\n\n\n\n\n\n\n\n\nFigure 1: Posterior predictive distribution of Logistic regression in the 2D feature space using plugin estimator (left) and Laplace approximation (right).\n\n\n\n\n\n… to Bayesian Neural Networks\nNow let’s step it up a notch: we will repeat the exercise from above, but this time for data that is not linearly separable using a simple MLP instead of the single-layer neural network we used above. The code below is almost the same as above, so I will not go through the various steps again.\n\n\nCode\n# Number of points to generate:\nxs, y = toy_data_non_linear(200)\nX = hcat(xs...); # bring into tabular format\ndata = zip(xs,y)\n\n# Build MLP:\nn_hidden = 32\nD = size(X)[1]\nnn = Chain(\n Dense(D, n_hidden, σ),\n Dense(n_hidden, 1)\n) \nλ = 0.01\nsqnorm(x) = sum(abs2, x)\nweight_regularization(λ=λ) = 1/2 * λ^2 * sum(sqnorm, Flux.params(nn))\nloss(x, y) = Flux.Losses.logitbinarycrossentropy(nn(x), y) + weight_regularization()\n\n# Training:\nepochs = 200\nfor epoch = 1:epochs\n for d in data\n gs = gradient(params(nn)) do\n l = loss(d...)\n end\n update!(opt, params(nn), gs)\n end\nend\n\n\nFitting the Laplace approximation is also analogous, but note that this we have added an argument: subset_of_weights=:last_layer. This specifies that we only want to use the parameters of the last layer of our MLP. While we could have used all of them (subset_of_weights=:all), Daxberger et al. (2021) find that the last-layer Laplace approximation produces satisfying results, while be computationally cheaper. Figure 2 demonstrates that once again the Laplace approximation yields a posterior predictive distribution that is more conservative than the over-confident plugin estimate.\n\n\nCode\nla = laplace(nn, λ=λ, subset_of_weights=:last_layer)\nfit!(la, data);\np_plugin = plot_contour(X',y,la;title=\"Plugin\",type=:plugin)\np_laplace = plot_contour(X',y,la;title=\"Laplace\")\n# Plot the posterior distribution with a contour plot.\nplt = plot(p_plugin, p_laplace, layout=(1,2), size=(1000,400))\nsavefig(plt, \"www/posterior_predictive_mlp.png\");\n\n\n\n\n\n\n\n\nFigure 2: Posterior predictive distribution of MLP in the 2D feature space using plugin estimator (left) and Laplace approximation (right).\n\n\n\nTo see why this is a desirable outcome consider the zoomed out version of Figure 2 below: the plugin estimator classifies with full confidence in regions completely scarce of any data. Arguably Laplace approximation produces a much more reasonable picture, even though it too could likely be improved by fine-tuning our choice of \\(\\lambda\\) and the neural network architecture.\n\n\nCode\nzoom=-50\np_plugin = plot_contour(X',y,la;title=\"Plugin\",type=:plugin,zoom=zoom);\np_laplace = plot_contour(X',y,la;title=\"Laplace\",zoom=zoom);\n# Plot the posterior distribution with a contour plot.\nplt = plot(p_plugin, p_laplace, layout=(1,2), size=(1000,400));\nsavefig(plt, \"www/posterior_predictive_mlp_zoom.png\");\n\n\n\n\n\n\n\n\nFigure 3: Posterior predictive distribution of MLP in the 2D feature space using plugin estimator (left) and Laplace approximation (right). Zoomed out."
+ "objectID": "blog/posts/conformal-prediction/index.html#sec-julia",
+ "href": "blog/posts/conformal-prediction/index.html#sec-julia",
+ "title": "Conformal Prediction in Julia 🟣🔴🟢",
+ "section": "📦 Conformal Prediction in Julia",
+ "text": "📦 Conformal Prediction in Julia\nIn this section of this first short post on CP we will look at how conformal prediction can be implemented in Julia. In particular, we will look at an approach that is compatible with any of the many supervised machine learning models available in MLJ: a beautiful, comprehensive machine learning framework funded by the Alan Turing Institute and the New Zealand Strategic Science Investment Fund Blaom et al. (2020). We will go through some basic usage examples employing a new Julia package that I have been working on: ConformalPrediction.jl.\n\n\n\n\n\n\nConformalPrediction.jl\n\n\n\nConformalPrediction.jl is a package for uncertainty quantification through conformal prediction for machine learning models trained in MLJ. At the time of writing it is still in its early stages of development, but already implements a range of different approaches to CP. Contributions are very much welcome:\n\nDocumentation\nContributor’s Guide\n\n\n\n\nSplit Conformal Classification\nWe consider a simple binary classification problem. Let \\((X_i, Y_i), \\ i=1,...,n\\) denote our feature-label pairs and let \\(\\mu: \\mathcal{X} \\mapsto \\mathcal{Y}\\) denote the mapping from features to labels. For illustration purposes we will use the moons dataset 🌙. Using MLJ.jl we first generate the data and split into into a training and test set:\n\n\nCode\nusing MLJ\nusing Random\nRandom.seed!(123)\n\n# Data:\nX, y = make_moons(500; noise=0.15)\ntrain, test = partition(eachindex(y), 0.8, shuffle=true)\n\n\nHere we will use a specific case of CP called split conformal prediction which can then be summarized as follows:1\n\nPartition the training into a proper training set and a separate calibration set: \\(\\mathcal{D}_n=\\mathcal{D}^{\\text{train}} \\cup \\mathcal{D}^{\\text{cali}}\\).\nTrain the machine learning model on the proper training set: \\(\\hat\\mu_{i \\in \\mathcal{D}^{\\text{train}}}(X_i,Y_i)\\).\nCompute nonconformity scores, \\(\\mathcal{S}\\), using the calibration data \\(\\mathcal{D}^{\\text{cali}}\\) and the fitted model \\(\\hat\\mu_{i \\in \\mathcal{D}^{\\text{train}}}\\).\nFor a user-specified desired coverage ratio \\((1-\\alpha)\\) compute the corresponding quantile, \\(\\hat{q}\\), of the empirical distribution of nonconformity scores, \\(\\mathcal{S}\\).\nFor the given quantile and test sample \\(X_{\\text{test}}\\), form the corresponding conformal prediction set:\n\n\\[\nC(X_{\\text{test}})=\\{y:s(X_{\\text{test}},y) \\le \\hat{q}\\}\n\\tag{1}\\]\nThis is the default procedure used for classification and regression in ConformalPrediction.jl.\nYou may want to take a look at the source code for the classification case here. As a first important step, we begin by defining a concrete type SimpleInductiveClassifier that wraps a supervised model from MLJ.jl and reserves additional fields for a few hyperparameters. As a second step, we define the training procedure, which includes the data-splitting and calibration step. Finally, as a third step we implement the procedure in Equation 1 to compute the conformal prediction set.\n\n\n\n\n\n\nDevelopment Status\n\n\n\nThe permalinks above take you to the version of the package that was up-to-date at the time of writing. Since the package is in its early stages of development, the code base and API can be expected to change.\n\n\nNow let’s take this to our 🌙 data. To illustrate the package functionality we will demonstrate the envisioned workflow. We first define our atomic machine learning model following standard MLJ.jl conventions. Using ConformalPrediction.jl we then wrap our atomic model in a conformal model using the standard API call conformal_model(model::Supervised; kwargs...). To train and predict from our conformal model we can then rely on the conventional MLJ.jl procedure again. In particular, we wrap our conformal model in data (turning it into a machine) and then fit it on the training set. Finally, we use our machine to predict the label for a new test sample Xtest:\n\n\nCode\n# Model:\nKNNClassifier = @load KNNClassifier pkg=NearestNeighborModels\nmodel = KNNClassifier(;K=50) \n\n# Training:\nusing ConformalPrediction\nconf_model = conformal_model(model; coverage=.9)\nmach = machine(conf_model, X, y)\nfit!(mach, rows=train)\n\n# Conformal Prediction:\nXtest = selectrows(X, first(test))\nytest = y[first(test)]\npredict(mach, Xtest)[1]\n\n\nimport NearestNeighborModels\n\n\n ✔\n\n\nUnivariateFinite{Multiclass{2}}(0=>0.94)\n\n\nThe final predictions are set-valued. While the softmax output remains unchanged for the SimpleInductiveClassifier, the size of the prediction set depends on the chosen coverage rate, \\((1-\\alpha)\\).\n\n\nWhen specifying a coverage rate very close to one, the prediction set will typically include many (in some cases all) of the possible labels. Below, for example, both classes are included in the prediction set when setting the coverage rate equal to \\((1-\\alpha)\\)=1.0. This is intuitive, since high coverage quite literally requires that the true label is covered by the prediction set with high probability.\n\n\n\n\nCode\nconf_model = conformal_model(model; coverage=coverage)\nmach = machine(conf_model, X, y)\nfit!(mach, rows=train)\n\n# Conformal Prediction:\nXtest = (x1=[1],x2=[0])\npredict(mach, Xtest)[1]\n\n\nUnivariateFinite{Multiclass{2}}(0=>0.5, 1=>0.5)\n\n\n\n\nConversely, for low coverage rates, prediction sets can also be empty. For a choice of \\((1-\\alpha)\\)=0.1, for example, the prediction set for our test sample is empty. This is a bit difficult to think about intuitively and I have not yet come across a satisfactory, intuitive interpretation.2 When the prediction set is empty, the predict call currently returns missing:\n\n\n\n\nCode\nconf_model = conformal_model(model; coverage=coverage)\nmach = machine(conf_model, X, y)\nfit!(mach, rows=train)\n\n# Conformal Prediction:\npredict(mach, Xtest)[1]\n\n\nmissing\n\n\nFigure 1 should provide some more intuition as to what exactly is happening here. It illustrates the effect of the chosen coverage rate on the predicted softmax output and the set size in the two-dimensional feature space. Contours are overlayed with the moon data points (including test data). The two samples highlighted in red, \\(X_1\\) and \\(X_2\\), have been manually added for illustration purposes. Let’s look at these one by one.\nFirstly, note that \\(X_1\\) (red cross) falls into a region of the domain that is characterized by high predictive uncertainty. It sits right at the bottom-right corner of our class-zero moon 🌜 (orange), a region that is almost entirely enveloped by our class-one moon 🌛 (green). For low coverage rates the prediction set for \\(X_1\\) is empty: on the left-hand side this is indicated by the missing contour for the softmax probability; on the right-hand side we can observe that the corresponding set size is indeed zero. For high coverage rates the prediction set includes both \\(y=0\\) and \\(y=1\\), indicative of the fact that the conformal classifier is uncertain about the true label.\nWith respect to \\(X_2\\), we observe that while also sitting on the fringe of our class-zero moon, this sample populates a region that is not fully enveloped by data points from the opposite class. In this region, the underlying atomic classifier can be expected to be more certain about its predictions, but still not highly confident. How is this reflected by our corresponding conformal prediction sets?\n\n\nCode\nXtest_2 = (x1=[-0.5],x2=[0.25])\ncov_ = .9\nconf_model = conformal_model(model; coverage=cov_)\nmach = machine(conf_model, X, y)\nfit!(mach, rows=train)\np̂_2 = pdf(predict(mach, Xtest_2)[1], 0)\n\n\n\n\nWell, for low coverage rates (roughly \\(<0.9\\)) the conformal prediction set does not include \\(y=0\\): the set size is zero (right panel). Only for higher coverage rates do we have \\(C(X_2)=\\{0\\}\\): the coverage rate is high enough to include \\(y=0\\), but the corresponding softmax probability is still fairly low. For example, for \\((1-\\alpha)=0.9\\) we have \\(\\hat{p}(y=0|X_2)=0.72.\\)\n\n\nThese two examples illustrate an interesting point: for regions characterised by high predictive uncertainty, conformal prediction sets are typically empty (for low coverage) or large (for high coverage). While set-valued predictions may be something to get used to, this notion is overall intuitive.\n\n\nCode\n# Setup\ncoverages = range(0.75,1.0,length=5)\nn = 100\nx1_range = range(extrema(X.x1)...,length=n)\nx2_range = range(extrema(X.x2)...,length=n)\n\nanim = @animate for coverage in coverages\n conf_model = conformal_model(model; coverage=coverage)\n mach = machine(conf_model, X, y)\n fit!(mach, rows=train)\n p1 = contourf_cp(mach, x1_range, x2_range; type=:proba, title=\"Softmax\", axis=nothing)\n scatter!(p1, X.x1, X.x2, group=y, ms=2, msw=0, alpha=0.75)\n scatter!(p1, Xtest.x1, Xtest.x2, ms=6, c=:red, label=\"X₁\", shape=:cross, msw=6)\n scatter!(p1, Xtest_2.x1, Xtest_2.x2, ms=6, c=:red, label=\"X₂\", shape=:diamond, msw=6)\n p2 = contourf_cp(mach, x1_range, x2_range; type=:set_size, title=\"Set size\", axis=nothing)\n scatter!(p2, X.x1, X.x2, group=y, ms=2, msw=0, alpha=0.75)\n scatter!(p2, Xtest.x1, Xtest.x2, ms=6, c=:red, label=\"X₁\", shape=:cross, msw=6)\n scatter!(p2, Xtest_2.x1, Xtest_2.x2, ms=6, c=:red, label=\"X₂\", shape=:diamond, msw=6)\n plot(p1, p2, plot_title=\"(1-α)=$(round(coverage,digits=2))\", size=(800,300))\nend\n\ngif(anim, fps=0.5)\n\n\n\n\n\n\n\nFigure 1: The effect of the coverage rate on the conformal prediction set. Softmax probabilities are shown on the left. The size of the prediction set is shown on the right."
},
{
- "objectID": "blog/posts/effortsless-bayesian-dl/index.html#wrapping-up",
- "href": "blog/posts/effortsless-bayesian-dl/index.html#wrapping-up",
- "title": "Go deep, but also … go Bayesian!",
- "section": "Wrapping up",
- "text": "Wrapping up\nRecent state-of-the-art research on neural information processing suggests that Bayesian deep learning can be effortless: Laplace approximation for deep neural networks appears to work very well and it does so at minimal computational cost (Daxberger et al. 2021). This is great news, because the case for turning Bayesian is strong: society increasingly relies on complex automated decision-making systems that need to be trustworthy. More and more of these systems involve deep learning which in and of itself is not trustworthy. We have seen that typically there exist various viable parameterizations of deep neural networks each with their own distinct and compelling explanation for the data at hand. When faced with many viable options, don’t put all of your eggs in one basket. In other words, go Bayesian!"
+ "objectID": "blog/posts/conformal-prediction/index.html#conclusion",
+ "href": "blog/posts/conformal-prediction/index.html#conclusion",
+ "title": "Conformal Prediction in Julia 🟣🔴🟢",
+ "section": "🏁 Conclusion",
+ "text": "🏁 Conclusion\nThis has really been a whistle-stop tour of Conformal Prediction: an active area of research that probably deserves much more attention. Hopefully, though, this post has helped to provide some color and, if anything, made you more curious about the topic. Let’s recap the TL;DR from above:\n\nConformal Prediction is an interesting frequentist approach to uncertainty quantification that can even be combined with Bayes (Section 1).\nIt is scalable and model-agnostic and therefore well applicable to machine learning (Section 1).\nConformalPrediction.jl implements CP in pure Julia and can be used with any supervised model available from MLJ.jl (Section 2).\nImplementing CP directly on top of an existing, powerful machine learning toolkit demonstrates the potential usefulness of this framework to the ML community (Section 2).\nStandard conformal classifiers produce set-valued predictions: for ambiguous samples these sets are typically large (for high coverage) or empty (for low coverage) (Section 2.1).\n\nBelow I will leave you with some further resources."
},
{
- "objectID": "blog/posts/effortsless-bayesian-dl/index.html#resources",
- "href": "blog/posts/effortsless-bayesian-dl/index.html#resources",
- "title": "Go deep, but also … go Bayesian!",
- "section": "Resources",
- "text": "Resources\nTo get started with Bayesian deep learning I have found many useful and free resources online, some of which are listed below:\n\nTuring.jl tutorial on Bayesian deep learning in Julia.\nVarious RStudio AI blog posts including this one and this one.\nTensorFlow blog post on regression with probabilistic layers.\nKevin Murphy’s draft text book, now also available as print."
+ "objectID": "blog/posts/conformal-prediction/index.html#further-resources",
+ "href": "blog/posts/conformal-prediction/index.html#further-resources",
+ "title": "Conformal Prediction in Julia 🟣🔴🟢",
+ "section": "📚 Further Resources",
+ "text": "📚 Further Resources\nChances are that you have already come across the Awesome Conformal Prediction repo: Manokhin (2022) provides a comprehensive, up-to-date overview of resources related to the conformal prediction. Among the listed articles you will also find Angelopoulos and Bates (2022), which inspired much of this post. The repo also points to open-source implementations in other popular programming languages including Python and R."
},
{
- "objectID": "blog/posts/effortsless-bayesian-dl/index.html#footnotes",
- "href": "blog/posts/effortsless-bayesian-dl/index.html#footnotes",
- "title": "Go deep, but also … go Bayesian!",
+ "objectID": "blog/posts/conformal-prediction/index.html#footnotes",
+ "href": "blog/posts/conformal-prediction/index.html#footnotes",
+ "title": "Conformal Prediction in Julia 🟣🔴🟢",
"section": "Footnotes",
- "text": "Footnotes\n\n\nSee for example this article in the MIT Technology Review↩︎\nIn fact, not treating probabilistic deep learning models as such is sheer madness because remember that the underlying parameters \\(\\theta\\) are random variables. Frequentists and Bayesians alike will tell you that relying on a single point estimate of random variables is just nuts!↩︎\nProponents of Causal AI like Judea Pearl would argue that the Bayesian treatment still does not go far enough: in their view model explanations can only be truly compelling if they are causally found.↩︎\nSee this answer on Stack Exchange for a detailed discussion.↩︎"
+ "text": "Footnotes\n\n\nIn other places split conformal prediction is sometimes referred to as inductive conformal prediction.↩︎\nAny thoughts/comments welcome!↩︎"
},
{
- "objectID": "blog/posts/conformal-regression/index.html",
- "href": "blog/posts/conformal-regression/index.html",
- "title": "Prediction Intervals for any Regression Model",
+ "objectID": "blog/posts/conformal-llm/index.html",
+ "href": "blog/posts/conformal-llm/index.html",
+ "title": "Building a Conformal Chatbot in Julia",
"section": "",
- "text": "Conformal Prediction intervals for differentcoverage rates. As coverage grows, so doesthe width of the prediction interval.\nThis is the third (and for now final) part of a series of posts that introduce Conformal Prediction in Julia using ConformalPrediction.jl. The first post introduced Conformal Prediction for supervised classification tasks: we learned that conformal classifiers produce set-valued predictions that are guaranteed to include the true label of a new sample with a certain probability. In the second post we applied these ideas to a more hands-on example: we saw how easy it is to use ConformalPrediction.jl to conformalize a Deep Learning image classifier.\nIn this post, we will look at regression models instead, that is supervised learning tasks involving a continuous outcome variable. Regression tasks are as ubiquitous as classification tasks. For example, we might be interested in using a machine learning model to predict house prices or the inflation rate of the Euro or the parameter size of the next large language model. In fact, many readers may be more familiar with regression models than classification, in which case it may also be easier for you to understand Conformal Prediction (CP) in this context."
- },
- {
- "objectID": "blog/posts/conformal-regression/index.html#background",
- "href": "blog/posts/conformal-regression/index.html#background",
- "title": "Prediction Intervals for any Regression Model",
- "section": "📖 Background",
- "text": "📖 Background\nBefore we start, let’s briefly recap what CP is all about. Don’t worry, we’re not about to deep-dive into methodology. But just to give you a high-level description upfront:\n\nConformal prediction (a.k.a. conformal inference) is a user-friendly paradigm for creating statistically rigorous uncertainty sets/intervals for the predictions of such models. Critically, the sets are valid in a distribution-free sense: they possess explicit, non-asymptotic guarantees even without distributional assumptions or model assumptions.\n— Angelopoulos and Bates (2022) (arXiv)\n\nIntuitively, CP works under the premise of turning heuristic notions of uncertainty into rigorous uncertainty estimates through repeated sampling or the use of dedicated calibration data.\nIn what follows we will explore what CP can do by going through a standard machine learning workflow using MLJ.jl and ConformalPrediction.jl. There will be less focus on how exactly CP works, but references will point you to additional resources.\n\n\n\n\n\n\nInteractive Version\n\n\n\nThis post is also available as a fully interactive Pluto.jl 🎈 notebook hosted on binder: \nIn my own experience, this may take some time to load, certainly long enough to get yourself a hot beverage ☕ or first read on here. But I promise you that the wait is worth it!"
- },
- {
- "objectID": "blog/posts/conformal-regression/index.html#data",
- "href": "blog/posts/conformal-regression/index.html#data",
- "title": "Prediction Intervals for any Regression Model",
- "section": "📈 Data",
- "text": "📈 Data\nMost machine learning workflows start with data. For illustrative purposes we will work with synthetic data. The helper function below can be used to generate some regression data.\n\n\nCode\nfunction get_data(;N=1000, xmax=3.0, noise=0.5, fun::Function=fun(X) = X * sin(X))\n # Inputs:\n d = Distributions.Uniform(-xmax, xmax)\n X = rand(d, N)\n X = MLJBase.table(reshape(X, :, 1))\n\n # Outputs:\n ε = randn(N) .* noise\n y = @.(fun(X.x1)) + ε\n y = vec(y)\n return X, y\nend\n\n\nFigure 1 illustrates our observations (dots) along with the ground-truth mapping from inputs to outputs (line). We have defined that mapping \\(f: \\mathcal{X} \\mapsto \\mathcal{Y}\\) as follows:\n\n\nCode\nf(X) = X * cos(X)\n\n\n\n\n\n\n\n\n\n \n \n \n\n\n\n \n \n \n\n\n\n \n \n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nFigure 1: Some synthetic regression data. Observations are shown as dots. The ground-truth mapping from inputs to outputs is shown as a dashed line."
+ "text": "Short demo of our conformal chatbot.\nLarge Language Models are all the buzz right now. They are used for a variety of tasks, including text classification, question answering, and text generation. In this tutorial, we will show how to conformalize a transformer language model for text classification. We will use the Banking77 dataset (Casanueva et al. 2020), which consists of 13,083 queries from 77 intents. On the model side, we will use the DistilRoBERTa model, which is a distilled version of RoBERTa (Liu et al. 2019) finetuned on the Banking77 dataset."
},
{
- "objectID": "blog/posts/conformal-regression/index.html#model-training-using-mlj",
- "href": "blog/posts/conformal-regression/index.html#model-training-using-mlj",
- "title": "Prediction Intervals for any Regression Model",
- "section": "🏋️ Model Training using MLJ",
- "text": "🏋️ Model Training using MLJ\nConformalPrediction.jl is interfaced to MLJ.jl (Blaom et al. 2020): a comprehensive Machine Learning Framework for Julia. MLJ.jl provides a large and growing suite of popular machine learning models that can be used for supervised and unsupervised tasks. Conformal Prediction is a model-agnostic approach to uncertainty quantification, so it can be applied to any common supervised machine learning model.\nThe interface to MLJ.jl therefore seems natural: any (supervised) MLJ.jl model can now be conformalized using ConformalPrediction.jl. By leveraging existing MLJ.jl functionality for common tasks like training, prediction and model evaluation, this package is light-weight and scalable. Now let’s see how all of that works …\nTo start with, let’s split our data into a training and test set:\n\n\nCode\ntrain, test = partition(eachindex(y), 0.4, 0.4, shuffle=true)\n\n\nNow let’s define a model for our regression task:\n\n\nCode\nModel = @load KNNRegressor pkg = NearestNeighborModels\nmodel = Model()\n\n\n\n\n\n\n\n\nHave it your way!\n\n\n\nThink this dataset is too simple? Wondering why on earth I’m not using XGBoost for this task? In the interactive version of this post you have full control over the data and the model. Try it out!\n\n\nUsing standard MLJ.jl workflows let us now first train the unconformalized model. We first wrap our model in data:\n\n\nCode\nmach_raw = machine(model, X, y)\n\n\nThen we fit the machine to the training data:\n\n\nCode\nMLJBase.fit!(mach_raw, rows=train, verbosity=0)\n\n\nFigure 2 below shows the resulting point predictions for the test data set:\n\n\n\n\n\n\n\n \n \n \n\n\n\n \n \n \n\n\n\n \n \n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nFigure 2: Point predictions for our machine learning model.\n\n\n\n\nHow is our model doing? It’s never quite right, of course, since predictions are estimates and therefore uncertain. Let’s see how we can use Conformal Prediction to express that uncertainty."
+ "objectID": "blog/posts/conformal-llm/index.html#huggingface-model",
+ "href": "blog/posts/conformal-llm/index.html#huggingface-model",
+ "title": "Building a Conformal Chatbot in Julia",
+ "section": "🤗 HuggingFace Model",
+ "text": "🤗 HuggingFace Model\nThe model can be loaded from HF straight into our running Julia session using the Transformers.jl package. Below we load the tokenizer tkr and the model mod. The tokenizer is used to convert the text into a sequence of integers, which is then fed into the model. The model outputs a hidden state, which is then fed into a classifier to get the logits for each class. Finally, the logits are then passed through a softmax function to get the corresponding predicted probabilities. Below we run a few queries through the model to see how it performs.\n\n\nCode\n# Load model from HF 🤗:\ntkr = hgf\"mrm8488/distilroberta-finetuned-banking77:tokenizer\"\nmod = hgf\"mrm8488/distilroberta-finetuned-banking77:ForSequenceClassification\"\n\n# Test model:\nquery = [\n \"What is the base of the exchange rates?\",\n \"Why is my card not working?\",\n \"My Apple Pay is not working, what should I do?\",\n]\na = encode(tkr, query)\nb = mod.model(a)\nc = mod.cls(b.hidden_state)\nd = softmax(c.logit)\n[labels[i] for i in Flux.onecold(d)]\n\n\n3-element Vector{String}:\n \"exchange_rate\"\n \"card_not_working\"\n \"apple_pay_or_google_pay\""
},
{
- "objectID": "blog/posts/conformal-regression/index.html#conformalizing-the-model",
- "href": "blog/posts/conformal-regression/index.html#conformalizing-the-model",
- "title": "Prediction Intervals for any Regression Model",
- "section": "🔥 Conformalizing the Model",
- "text": "🔥 Conformalizing the Model\nWe can turn our model into a conformalized model in just one line of code:\n\n\nCode\nconf_model = conformal_model(model)\n\n\nBy default conformal_model creates an Inductive Conformal Regressor (more on this below) when called on a <:Deterministic model. This behaviour can be changed by using the optional method key argument.\nTo train our conformal model we can once again rely on standard MLJ.jl workflows. We first wrap our model in data:\n\n\nCode\nmach = machine(conf_model, X, y)\n\n\nThen we fit the machine to the data:\n\n\nCode\nMLJBase.fit!(mach, rows=train, verbosity=0)\n\n\nNow let us look at the predictions for our test data again. The chart below shows the results for our conformalized model. Predictions from conformal regressors are range-valued: for each new sample the model returns an interval \\((y_{\\text{lb}},y_{\\text{ub}})\\in\\mathcal{Y}\\) that covers the test sample with a user-specified probability \\((1-\\alpha)\\), where \\(\\alpha\\) is the expected error rate. This is known as the marginal coverage guarantee and it is proven to hold under the assumption that training and test data are exchangeable.\n\n\n\n\n\n\n\n \n \n \n\n\n\n \n \n \n\n\n\n \n \n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nFigure 3: Prediction intervals for our conformalized machine learning model.\n\n\n\n\nIntuitively, a higher coverage rate leads to larger prediction intervals: since a larger interval covers a larger subspace of \\(\\mathcal{Y}\\), it is more likely to cover the true value.\nI don’t expect you to believe me that the marginal coverage property really holds. In fact, I couldn’t believe it myself when I first learned about it. If you like mathematical proofs, you can find one in this tutorial, for example. If you like convincing yourself through empirical observations, read on below …"
+ "objectID": "blog/posts/conformal-llm/index.html#mlj-interface",
+ "href": "blog/posts/conformal-llm/index.html#mlj-interface",
+ "title": "Building a Conformal Chatbot in Julia",
+ "section": "🔁 MLJ Interface",
+ "text": "🔁 MLJ Interface\nSince our package is interfaced to MLJ.jl, we need to define a wrapper model that conforms to the MLJ interface. In order to add the model for general use, we would probably go through MLJFlux.jl, but for this tutorial, we will make our life easy and simply overload the MLJBase.fit and MLJBase.predict methods. Since the model from HF is already pre-trained and we are not interested in further fine-tuning, we will simply return the model object in the MLJBase.fit method. The MLJBase.predict method will then take the model object and the query and return the predicted probabilities. We also need to define the MLJBase.target_scitype and MLJBase.predict_mode methods. The former tells MLJ what the output type of the model is, and the latter can be used to retrieve the label with the highest predicted probability.\n\n\nCode\nstruct IntentClassifier <: MLJBase.Probabilistic\n tkr::TextEncoders.AbstractTransformerTextEncoder\n mod::HuggingFace.HGFRobertaForSequenceClassification\nend\n\nfunction IntentClassifier(;\n tokenizer::TextEncoders.AbstractTransformerTextEncoder, \n model::HuggingFace.HGFRobertaForSequenceClassification,\n)\n IntentClassifier(tkr, mod)\nend\n\nfunction get_hidden_state(clf::IntentClassifier, query::Union{AbstractString, Vector{<:AbstractString}})\n token = encode(clf.tkr, query)\n hidden_state = clf.mod.model(token).hidden_state\n return hidden_state\nend\n\n# This doesn't actually retrain the model, but it retrieves the classifier object\nfunction MLJBase.fit(clf::IntentClassifier, verbosity, X, y)\n cache=nothing\n report=nothing\n fitresult = (clf = clf.mod.cls, labels = levels(y))\n return fitresult, cache, report\nend\n\nfunction MLJBase.predict(clf::IntentClassifier, fitresult, Xnew)\n output = fitresult.clf(get_hidden_state(clf, Xnew))\n p̂ = UnivariateFinite(fitresult.labels,softmax(output.logit)',pool=missing)\n return p̂\nend\n\nMLJBase.target_scitype(clf::IntentClassifier) = AbstractVector{<:Finite}\n\nMLJBase.predict_mode(clf::IntentClassifier, fitresult, Xnew) = mode.(MLJBase.predict(clf, fitresult, Xnew))\n\n\nTo test that everything is working as expected, we fit the model and generated predictions for a subset of the test data:\n\n\nCode\nclf = IntentClassifier(tkr, mod)\ntop_n = 10\nfitresult, _, _ = MLJBase.fit(clf, 1, nothing, y_test[1:top_n])\n@time ŷ = MLJBase.predict(clf, fitresult, queries_test[1:top_n]);\n\n\n 1.923436 seconds (8.61 M allocations: 631.348 MiB, 2.99% gc time, 84.31% compilation time)"
},
{
- "objectID": "blog/posts/conformal-regression/index.html#evaluation",
- "href": "blog/posts/conformal-regression/index.html#evaluation",
- "title": "Prediction Intervals for any Regression Model",
- "section": "🧐 Evaluation",
- "text": "🧐 Evaluation\nTo verify the marginal coverage property empirically we can look at the empirical coverage rate of our conformal predictor (see Section 3 of the tutorial for details). To this end our package provides a custom performance measure emp_coverage that is compatible with MLJ.jl model evaluation workflows. In particular, we will call evaluate! on our conformal model using emp_coverage as our performance metric. The resulting empirical coverage rate should then be close to the desired level of coverage.\n\n\nCode\nmodel_evaluation =\n evaluate!(_mach, operation=MLJBase.predict, measure=emp_coverage, verbosity=0)\nprintln(\"Empirical coverage: $(round(model_evaluation.measurement[1], digits=3))\")\nprintln(\"Coverage per fold: $(round.(model_evaluation.per_fold[1], digits=3))\")\n\n\nEmpirical coverage: 0.909\nCoverage per fold: [0.94, 0.928, 0.892, 0.874, 0.898, 0.922]\n\n\n\n\n\n✅ ✅ ✅ Great! We got an empirical coverage rate that is slightly higher than desired 😁 … but why isn’t it exactly the same?\n\nIn most cases it will be slightly higher than desired, since \\((1-\\alpha)\\) is a lower bound. But note that it can also be slightly lower than desired. That is because the coverage property is “marginal” in the sense that the probability is averaged over the randomness in the data. For most purposes a large enough calibration set size (\\(n>1000\\)) mitigates that randomness enough. Depending on your choices above, the calibration set may be quite small (set to 500), which can lead to coverage slack (see Section 3 in the tutorial).\n\n\n\nSo what’s happening under the hood?\nInductive Conformal Prediction (also referred to as Split Conformal Prediction) broadly speaking works as follows:\n\nPartition the training into a proper training set and a separate calibration set\nTrain the machine learning model on the proper training set.\nUsing some heuristic notion of uncertainty (e.g., absolute error in the regression case), compute nonconformity scores using the calibration data and the fitted model.\nFor the given coverage ratio compute the corresponding quantile of the empirical distribution of nonconformity scores.\nFor the given quantile and test sample \\(X_{\\text{test}}\\), form the corresponding conformal prediction set like so: \\(C(X_{\\text{test}})=\\{y:s(X_{\\text{test}},y) \\le \\hat{q}\\}\\)"
+ "objectID": "blog/posts/conformal-llm/index.html#conformal-chatbot",
+ "href": "blog/posts/conformal-llm/index.html#conformal-chatbot",
+ "title": "Building a Conformal Chatbot in Julia",
+ "section": "🤖 Conformal Chatbot",
+ "text": "🤖 Conformal Chatbot\nTo turn the wrapped, pre-trained model into a conformal intent classifier, we can now rely on standard API calls. We first wrap our atomic model where we also specify the desired coverage rate and method. Since even simple forward passes are computationally expensive for our (small) LLM, we rely on Simple Inductive Conformal Classification.\nconf_model = conformal_model(clf; coverage=0.99, method=:simple_inductive, train_ratio=train_ratio)\nmach = machine(conf_model, queries, y)\n@time fit!(mach)\nSerialization.serialize(\"dev/private/simple_inductive.jls\", mach)\nFinally, we use our conformal LLM to build a simple yet powerful chatbot that runs directly in the Julia REPL. Without dwelling on the details too much, the conformal_chatbot works as follows:\n\nPrompt user to explain their intent.\nFeed user input through conformal LLM and present the output to the user.\nIf the conformal prediction set includes more than one label, prompt the user to either refine their input or choose one of the options included in the set.\n\n\n\nCode\nmach = Serialization.deserialize(\"../dev/private/simple_inductive.jls\")\n\nfunction prediction_set(mach, query::String)\n p̂ = MLJBase.predict(mach, query)[1]\n probs = pdf.(p̂, collect(1:77))\n in_set = findall(probs .!= 0)\n labels_in_set = labels[in_set]\n probs_in_set = probs[in_set]\n _order = sortperm(-probs_in_set)\n plt = UnicodePlots.barplot(labels_in_set[_order], probs_in_set[_order], title=\"Possible Intents\")\n return labels_in_set, plt\nend\n\nfunction conformal_chatbot()\n println(\"👋 Hi, I'm a Julia, your conformal chatbot. I'm here to help you with your banking query. Ask me anything or type 'exit' to exit ...\\n\")\n completed = false\n queries = \"\"\n while !completed\n query = readline()\n queries = queries * \",\" * query\n labels, plt = prediction_set(mach, queries)\n if length(labels) > 1\n println(\"🤔 Hmmm ... I can think of several options here. If any of these applies, simply type the corresponding number (e.g. '1' for the first option). Otherwise, can you refine your question, please?\\n\")\n println(plt)\n else\n println(\"🥳 I think you mean $(labels[1]). Correct?\")\n end\n\n # Exit:\n if query == \"exit\"\n println(\"👋 Bye!\")\n break\n end\n if query ∈ string.(collect(1:77))\n println(\"👍 Great! You've chosen '$(labels[parse(Int64, query)])'. I'm glad I could help you. Have a nice day!\")\n completed = true\n end\n end\nend\n\n\nBelow we show the output for two example queries. The first one is very ambiguous. As expected, the size of the prediction set is therefore large.\n\n\nCode\nambiguous_query = \"transfer mondey?\"\nprediction_set(mach, ambiguous_query)[2]\n\n\n\n Possible Intents \n ┌ ┐ \n beneficiary_not_allowed ┤■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ 0.150517 \n balance_not_updated_after_bank_transfer ┤■■■■■■■■■■■■■■■■■■■■■■ 0.111409 \n transfer_into_account ┤■■■■■■■■■■■■■■■■■■■ 0.0939535 \n transfer_not_received_by_recipient ┤■■■■■■■■■■■■■■■■■■ 0.091163 \n top_up_by_bank_transfer_charge ┤■■■■■■■■■■■■■■■■■■ 0.0893061 \n failed_transfer ┤■■■■■■■■■■■■■■■■■■ 0.0888321 \n transfer_timing ┤■■■■■■■■■■■■■ 0.0641954 \n transfer_fee_charged ┤■■■■■■■ 0.0361131 \n pending_transfer ┤■■■■■ 0.0270795 \n receiving_money ┤■■■■■ 0.0252126 \n └ ┘ \n\n\n\nThe more refined version of the prompt yields a smaller prediction set: less ambiguous prompts result in lower predictive uncertainty.\n\n\nCode\nrefined_query = \"I tried to transfer money to my friend, but it failed.\"\nprediction_set(mach, refined_query)[2]\n\n\n\n Possible Intents \n ┌ ┐ \n failed_transfer ┤■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ 0.59042 \n beneficiary_not_allowed ┤■■■■■■■ 0.139806 \n transfer_not_received_by_recipient ┤■■ 0.0449784 \n balance_not_updated_after_bank_transfer ┤■■ 0.037894 \n └ ┘ \n\n\n\nBelow we include a short demo video that shows the REPL-based chatbot in action."
},
{
- "objectID": "blog/posts/conformal-regression/index.html#recap",
- "href": "blog/posts/conformal-regression/index.html#recap",
- "title": "Prediction Intervals for any Regression Model",
- "section": "🔃 Recap",
- "text": "🔃 Recap\nThis has been a super quick tour of ConformalPrediction.jl. We have seen how the package naturally integrates with MLJ.jl, allowing users to generate rigorous predictive uncertainty estimates for any supervised machine learning model.\n\nAre we done?\nQuite cool, right? Using a single API call we are able to generate rigorous prediction intervals for all kinds of different regression models. Have we just solved predictive uncertainty quantification once and for all? Do we even need to bother with anything else? Conformal Prediction is a very useful tool, but like so many other things, it is not the final answer to all our problems. In fact, let’s see if we can take CP to its limits.\nThe helper function to generate data from above takes an optional argument xmax. By increasing that value, we effectively expand the domain of our input. Let’s do that and see how our conformal model does on this new out-of-domain data.\n\n\n\n\n\n\n\n \n \n \n\n\n\n \n \n \n\n\n\n \n \n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nFigure 4: Prediction intervals for our conformalized machine learning model applied to out-of-domain data.\n\n\n\n\n\nWhooooops 🤕 … looks like we’re in trouble: in Figure 4 the prediction intervals do not cover out-of-domain test samples well. What happened here?\n\nBy expanding the domain of out inputs, we have violated the exchangeability assumption. When that assumption is violated, the marginal coverage property does not hold. But do not despair! There are ways to deal with this."
+ "objectID": "blog/posts/conformal-llm/index.html#wrapping-up",
+ "href": "blog/posts/conformal-llm/index.html#wrapping-up",
+ "title": "Building a Conformal Chatbot in Julia",
+ "section": "🌯 Wrapping Up",
+ "text": "🌯 Wrapping Up\nThis work was done in collaboration with colleagues at ING as part of the ING Analytics 2023 Experiment Week. Our team demonstrated that Conformal Prediction provides a powerful and principled alternative to top-K intent classification. We won the first prize by popular vote.\nThere are a lot of things that can be improved. As far as LLMs are concerned, we have of course used a fairly small model here. In terms of Conformal Prediction, we have relied on simple inductive conformal classification. This is a good starting point, but there are more advanced methods available (and implemented in the package). Another thing we did not take into consideration here is that we have many outcome classes and may in practice be interested in achieving class-conditional coverage. Stay tuned for more!"
},
{
- "objectID": "blog/posts/conformal-regression/index.html#read-on",
- "href": "blog/posts/conformal-regression/index.html#read-on",
- "title": "Prediction Intervals for any Regression Model",
- "section": "📚 Read on",
- "text": "📚 Read on\nIf you are curious to find out more, be sure to read on in the docs. There are also a number of useful resources to learn more about Conformal Prediction, a few of which I have listed below:\n\nA Gentle Introduction to Conformal Prediction and Distribution-Free Uncertainty Quantification by Angelopoulos and Bates (2022).\nAwesome Conformal Prediction repository by Manokhin (2022)\nMAPIE: a comprehensive Python library for conformal prediction.\nMy previous two blog posts.\n\nEnjoy!"
+ "objectID": "blog/index.html",
+ "href": "blog/index.html",
+ "title": "Posts",
+ "section": "",
+ "text": "Building a Conformal Chatbot in Julia\n\n\nHuggingFace, Transformers, and Conformal Prediction - Part 1\n\n\nFor this year’s edition of the ING Analytics Experiment Week, we put ConformalPrediction.jl to work and built a chatbot that can be used for Conformal Intent Recognition.\n\n\n\n\n\nJul 5, 2023\n\n\nPatrick Altmeyer\n\n\n7 min\n\n\n8/28/24, 5:16:41 PM\n\n\n\n\n\n\n\n\n\n\n\n\nPaving the Way Towards Low-Overhead Uncertainty Calibration\n\n\nAn Accessible Intro to Laplace Approximations in Julia for Bayesian Deep Learning\n\n\nA guest blog post by a team of students from TU Delft, who have contributed multiple improvements to LaplaceRedux.jl.\n\n\n\n\n\nJul 4, 2023\n\n\nPatrick Altmeyer, Severin Bratus, Mark Ardman, Adelina Cazacu, Andrei Ionescu, Ivan Makarov\n\n\n11 min\n\n\n8/28/24, 4:44:08 PM\n\n\n\n\n\n\n\n\n\n\n\n\nPrediction Intervals for any Regression Model\n\n\nConformal Prediction in Julia — Part 3\n\n\nThis third post introduces conformal regression by going through a standard machine learning workflow using MLJ.jl and ConformalPrediction.jl.\n\n\n\n\n\nDec 12, 2022\n\n\nPatrick Altmeyer\n\n\n11 min\n\n\n8/28/24, 4:44:08 PM\n\n\n\n\n\n\n\n\n\n\n\n\nHow to Conformalize a Deep Image Classifier\n\n\nConformal Prediction in Julia — Part 2\n\n\nA guide demonstrating how to use ConformalPrediction.jl to conformalize a deep image classifier in a few lines of code.\n\n\n\n\n\nDec 5, 2022\n\n\nPatrick Altmeyer\n\n\n9 min\n\n\n8/28/24, 4:44:08 PM\n\n\n\n\n\n\n\n\n\n\n\n\nConformal Prediction in Julia 🟣🔴🟢\n\n\nConformal Prediction in Julia — Part 1\n\n\nA (very) gentle introduction to Conformal Prediction in Julia using my new package ConformalPrediction.jl.\n\n\n\n\n\nOct 25, 2022\n\n\nPatrick Altmeyer\n\n\n15 min\n\n\n8/28/24, 4:44:08 PM\n\n\n\n\n\n\n\n\n\n\n\n\nA new tool for explainable AI\n\n\nCounterfactual Explanations in Julia — Part I\n\n\nThis post introduces a new Julia package for generating counterfactual explanations. The package can be used to explain machine learning algorithms developed and trained in Julia as well as other popular programming languages like Python and R. \n\n\n\n\n\nApr 20, 2022\n\n\nPatrick Altmeyer\n\n\n12 min\n\n\n8/28/24, 4:44:08 PM\n\n\n\n\n\n\n\n\n\n\n\n\nGo deep, but also … go Bayesian!\n\n\nEffortless Bayesian Deep Learning in Julia — Part I\n\n\nAn introduction to effortless Bayesian deep learning through Laplace approximation coded from scratch in Julia.\n\n\n\n\n\nFeb 18, 2022\n\n\nPatrick Altmeyer\n\n\n12 min\n\n\n8/28/24, 4:44:08 PM\n\n\n\n\n\n\nNo matching items"
}
]
\ No newline at end of file
diff --git a/docs/site_libs/bootstrap/bootstrap.min.css b/docs/site_libs/bootstrap/bootstrap.min.css
index 962efc9..e142eb2 100644
--- a/docs/site_libs/bootstrap/bootstrap.min.css
+++ b/docs/site_libs/bootstrap/bootstrap.min.css
@@ -1,4 +1,4 @@
-@import"https://fonts.googleapis.com/css2?family=Barlow:ital,wght@0,100;0,200;0,300;0,400;0,500;0,600;0,700;0,800;0,900;1,100;1,200;1,300;1,400;1,500;1,600;1,700;1,800;1,900&display=swap";@import"https://fonts.googleapis.com/css?family=Roboto";.hero-banner{position:relative;background-color:#e9edfb;display:flex;justify-content:center}.hero-banner h1,.hero-banner .h1{color:#4063d8;font-size:3.5rem}.carousel img{width:150px;height:150px;max-width:70%;margin-bottom:110px;background-color:#fff}.carousel .carousel-control-prev-icon,.carousel .carousel-control-next-icon{margin-bottom:110px}@font-face{font-family:JuliaMono-Light;src:url("https://cdn.jsdelivr.net/gh/cormullion/juliamono/webfonts/JuliaMono-Light.woff2")}/*!
+@import"https://fonts.googleapis.com/css2?family=Barlow:ital,wght@0,100;0,200;0,300;0,400;0,500;0,600;0,700;0,800;0,900;1,100;1,200;1,300;1,400;1,500;1,600;1,700;1,800;1,900&display=swap";@import"https://fonts.googleapis.com/css?family=Roboto";.welcome h1,.welcome .h1{color:#4063d8;font-size:3.5rem}.welcome h2,.welcome .h2{border-bottom:0cm;margin-top:0%}.hero-banner{position:relative;background-color:#e9edfb;display:flex;justify-content:center;padding-left:30px;padding-right:30px;flex-wrap:wrap}.hero-banner-text{flex:1;min-width:300px;margin:10px}.hero-banner-carousel{flex:1;min-width:300px;margin:10px}.carousel{margin-top:50px}.carousel img{width:300px;height:300px;max-width:70%;margin-bottom:110px;background-color:#fff}.carousel .carousel-control-prev-icon,.carousel .carousel-control-next-icon{margin-bottom:110px}@font-face{font-family:JuliaMono-Light;src:url("https://cdn.jsdelivr.net/gh/cormullion/juliamono/webfonts/JuliaMono-Light.woff2")}/*!
* Bootstrap v5.3.1 (https://getbootstrap.com/)
* Copyright 2011-2023 The Bootstrap Authors
* Licensed under MIT (https://github.com/twbs/bootstrap/blob/main/LICENSE)
diff --git a/docs/sitemap.xml b/docs/sitemap.xml
index 7e4e70d..5867577 100644
--- a/docs/sitemap.xml
+++ b/docs/sitemap.xml
@@ -1,71 +1,75 @@
- https://www.taija.org/blog/index.html
- 2024-08-28T14:44:08.419Z
+ https://www.taija.org/welcome.html
+ 2024-08-29T10:13:20.361Z
- https://www.taija.org/blog/posts/conformal-llm/index.html
- 2024-08-28T15:16:41.887Z
+ https://www.taija.org/blog/posts/conformal-regression/index.html
+ 2024-08-28T14:44:08.517Z
- https://www.taija.org/blog/posts/conformal-prediction/index.html
- 2024-08-28T14:44:08.513Z
+ https://www.taija.org/blog/posts/effortsless-bayesian-dl/index.html
+ 2024-08-28T14:44:08.536Z
- https://www.taija.org/blog/posts/guest-students-laplace/index.html
- 2024-08-28T14:44:08.555Z
+ https://www.taija.org/blog/posts/a-new-tool-for-explainable-ai/index.html
+ 2024-08-28T14:44:08.424Z
- https://www.taija.org/index.html
- 2024-08-28T14:30:07.790Z
+ https://www.taija.org/blog/posts/conformal-image-classifier/index.html
+ 2024-08-28T14:44:08.491Z
- https://www.taija.org/content/news/news.html
- 2024-08-28T12:51:08.043Z
+ https://www.taija.org/hero.html
+ 2024-08-29T10:29:47.950Z
- https://www.taija.org/content/contribute.html
- 2024-08-28T10:20:28.150Z
+ https://www.taija.org/content/related.html
+ 2024-08-28T10:23:47.260Z
- https://www.taija.org/content/contact.html
- 2024-08-28T14:39:18.356Z
+ https://www.taija.org/content/about.html
+ 2024-08-28T11:28:48.944Z
+
+
+ https://www.taija.org/content/sponsors.html
+ 2024-08-28T10:23:07.830Zhttps://www.taija.org/content/research.html2024-08-28T10:21:53.352Z
- https://www.taija.org/content/sponsors.html
- 2024-08-28T10:23:07.830Z
+ https://www.taija.org/content/contact.html
+ 2024-08-28T14:39:18.356Z
- https://www.taija.org/content/about.html
- 2024-08-28T11:28:48.944Z
+ https://www.taija.org/content/contribute.html
+ 2024-08-28T10:20:28.150Z
- https://www.taija.org/content/related.html
- 2024-08-28T10:23:47.260Z
+ https://www.taija.org/content/news/news.html
+ 2024-08-28T12:51:08.043Z
- https://www.taija.org/hero.html
- 2024-08-28T14:29:58.072Z
+ https://www.taija.org/index.html
+ 2024-08-29T10:29:59.372Z
- https://www.taija.org/blog/posts/conformal-image-classifier/index.html
- 2024-08-28T14:44:08.491Z
+ https://www.taija.org/blog/posts/guest-students-laplace/index.html
+ 2024-08-28T14:44:08.555Z
- https://www.taija.org/blog/posts/a-new-tool-for-explainable-ai/index.html
- 2024-08-28T14:44:08.424Z
+ https://www.taija.org/blog/posts/conformal-prediction/index.html
+ 2024-08-28T14:44:08.513Z
- https://www.taija.org/blog/posts/effortsless-bayesian-dl/index.html
- 2024-08-28T14:44:08.536Z
+ https://www.taija.org/blog/posts/conformal-llm/index.html
+ 2024-08-28T15:16:41.887Z
- https://www.taija.org/blog/posts/conformal-regression/index.html
- 2024-08-28T14:44:08.517Z
+ https://www.taija.org/blog/index.html
+ 2024-08-28T14:44:08.419Z
diff --git a/docs/welcome.html b/docs/welcome.html
new file mode 100644
index 0000000..483aed5
--- /dev/null
+++ b/docs/welcome.html
@@ -0,0 +1,602 @@
+
+
+
+
+
+
+
+
+
+welcome
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Welcome to Taija
+
+
Trustworthy Artificial Intelligence in Julia
+
Taija is the organization that hosts software geared towards Trustworthy Artificial Intelligence in Julia.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
\ No newline at end of file
diff --git a/profile/hero.qmd b/profile/hero.qmd
index d1e38b1..bf65824 100644
--- a/profile/hero.qmd
+++ b/profile/hero.qmd
@@ -1,21 +1,23 @@
-::: {.hero-banner}
-
-::: {.column-page}
+::: {.column-screen}
-::: {.grid}
+::: {.hero-banner}
-::: {.g-col-8}
-# Taija
+::: {.hero-banner-carousel}
+{{< include content/news/news.qmd >}}
+:::
-### Trustworthy Artificial Intelligence in Julia
+::: {.hero-banner-text}
+# Make sense of your AI models
-Taija is the organization that hosts software geared towards **T**rustworthy **A**rtificial **I**ntelligence in **J**uli**a**.
-:::
+Artificial Intelligence (AI) has been advancing rapidly in recent years. Consequently, Julia's AI ecosystem has also been growing fast. Taija is an effort to provide users with tools to make sense of the AI models that they train and deploy. Some highlights include:
-::: {.g-col-4}
-{{< include content/news/news.qmd >}}
-:::
+- Model Explainability ([CounterfactualExplanations.jl](https://github.com/JuliaTrustworthyAI/CounterfactualExplanations.jl))
+- Algorithmic Recourse ([CounterfactualExplanations.jl](https://github.com/JuliaTrustworthyAI/CounterfactualExplanations.jl), [AlgorithmicRecourseDynamics.jl](https://github.com/JuliaTrustworthyAI/AlgorithmicRecourseDynamics.jl))
+- Predictive Uncertainty Quantification ([ConformalPrediction.jl](https://github.com/JuliaTrustworthyAI/ConformalPrediction.jl), [LaplaceRedux.jl](https://github.com/JuliaTrustworthyAI/LaplaceRedux.jl))
+- Effortless Bayesian Deep Learning ([LaplaceRedux.jl](https://github.com/JuliaTrustworthyAI/LaplaceRedux.jl))
+- Hybrid Learning ([JointEnergyModels.jl](https://github.com/JuliaTrustworthyAI/JointEnergyModels.jl))
+Taija is a community effort largely maintained by academics and students at TU Delft. We welcome contributions of any kind.
:::
:::
diff --git a/profile/index.qmd b/profile/index.qmd
index 2b3b3e6..03551f4 100644
--- a/profile/index.qmd
+++ b/profile/index.qmd
@@ -4,6 +4,12 @@ page-layout: custom
css: index.css
---
+::: {.content-block}
+
+{{< include welcome.qmd >}}
+
+:::
+
{{< include hero.qmd >}}
::: {.content-block}
diff --git a/profile/theme-light.scss b/profile/theme-light.scss
index 865d984..bb613f4 100644
--- a/profile/theme-light.scss
+++ b/profile/theme-light.scss
@@ -11,23 +11,49 @@ $navbar-bg: lighten($primary, 45%);
// Footer
$footer-bg: lighten($primary, 45%);
+// Welcome
+.welcome h1 {
+ color: #4063D8;
+ font-size: 3.5rem;
+}
+
+.welcome h2 {
+ border-bottom: 0cm;
+ margin-top: 0%;
+}
+
// Hero banner
.hero-banner {
position: relative;
background-color: lighten($primary, 40%);
display: flex;
justify-content: center;
+ padding-left: 30px;
+ padding-right: 30px;
+ // padding-top: 10px;
+ flex-wrap: wrap;
}
-
-.hero-banner h1 {
- color: #4063D8;
- font-size: 3.5rem;
+
+.hero-banner-text {
+ flex: 1; /* Each item will take equal space */
+ min-width: 300px; /* Minimum width of each item */
+ margin: 10px;
+}
+
+.hero-banner-carousel {
+ flex: 1; /* Each item will take equal space */
+ min-width: 300px; /* Minimum width of each item */
+ margin: 10px;
}
// Carousel
+.carousel {
+ margin-top: 50px;
+}
+
.carousel img {
- width: 150px;
- height: 150px;
+ width: 300px;
+ height: 300px;
max-width: 70%;
margin-bottom: 110px;
background-color: lighten($primary, 45%);
diff --git a/profile/tmp.gif b/profile/tmp.gif
deleted file mode 100644
index 115caec..0000000
Binary files a/profile/tmp.gif and /dev/null differ
diff --git a/profile/welcome.qmd b/profile/welcome.qmd
new file mode 100644
index 0000000..04ca754
--- /dev/null
+++ b/profile/welcome.qmd
@@ -0,0 +1,9 @@
+::: {.welcome}
+
+# Welcome to Taija
+
+## Trustworthy Artificial Intelligence in Julia
+
+Taija is the organization that hosts software geared towards **T**rustworthy **A**rtificial **I**ntelligence in **J**uli**a**.
+
+:::
\ No newline at end of file