feat: add acquisition functions (#217)

encord-team · Mar 21, 2023 · 7c9e78b · 7c9e78b
1 parent 4155120
commit 7c9e78b
Show file tree

Hide file tree

Showing 10 changed files with 1,941 additions and 1 deletion.
diff --git a/docs/docs/active-learning/_category_.json b/docs/docs/active-learning/_category_.json
@@ -0,0 +1,5 @@
+{
+  "label": "Active Learning",
+  "position": 7,
+  "collapsible": true
+}
diff --git a/docs/docs/active-learning/acquisition-functions.md b/docs/docs/active-learning/acquisition-functions.md
@@ -0,0 +1,38 @@
+# Acquisition Functions
+
+We want you to select the data samples that will be the most informative to your model, so a natural approach would be to score each sample based on its predicted usefulness for training.
+Since labeling samples is usually done in batches, you could take the top _k_ scoring samples for annotation.
+This type of function, that takes an unlabeled data sample and outputs its score, is called _acquisition function_.
+
+## Uncertainty-based acquisition functions
+
+In **Encord Active**, we employ the _uncertainty sampling_ strategy where we score data samples based on the uncertainty of the model predictions.
+The assumption is that samples the model is unconfident about are likely to be more informative than samples for which the model is very confident about the label.
+
+We include the following uncertainty-based acquisition functions:
+* Least Confidence $U(x) = 1 - P_\theta(\hat{y}|x)$, where $\hat{y} = \underset{y \in \mathcal{Y}}{\arg\max} P_\theta(y|x)$
+* Margin $U(x) = P_\theta(\hat{y_1}|x) - P_\theta(\hat{y_2}|x)$, where $\hat{y_1}$ and $\hat{y_2}$ are the first and second highest-predicted labels
+* Variance $U(x) = Var(P_\theta(y|x)) = \frac{1}{|Y|} \underset{y \in \mathcal{Y}}{\sum} (P_\theta(y|x) - \mu)^2$, where $\mu = \frac{1}{|Y|} \underset{y \in \mathcal{Y}}{\sum} P_\theta(y|x)$
+
+* Entropy $U(x) = \mathcal{H}(P_\theta(y|x)) = -\underset{y \in \mathcal{Y}}{\sum} P_\theta(y|x) \log P_\theta(y|x)$
+
+:::caution 
+On the following scenarios, uncertainty-based acquisition functions must be used with extra care:
+* Softmax outputs from deep networks are often not calibrated and tend to be quite overconfident.
+* For convolutional neural networks, small, seemingly meaningless perturbations in the input space can completely change predictions.
+:::
+
+
+## Which acquisition function should I use?
+
+_“Ok, I have this list of acquisition functions now, but which one is the best? How do I choose?”_ 
+
+This isn’t an easy question to answer and heavily depends on your problem, your data, your model, your labeling budget, your goals, etc.
+This choice can be crucial to your results and comparing multiple acquisition functions during the active learning process is not always feasible. 
+
+This isn’t a question for which we can just give you a good answer.
+Simple uncertainty measures like least confident score, margin score and entropy make good first considerations.
+
+:::tip
+If you’d like to talk to an expert on the topic, the Encord ML team can be found in the #general channel in our Encord Active [Slack workspace](https://join.slack.com/t/encordactive/shared_invite/zt-1hc2vqur9-Fzj1EEAHoqu91sZ0CX0A7Q).
+:::
diff --git a/docs/docs/active-learning/active-learning-workflow.md b/docs/docs/active-learning/active-learning-workflow.md
@@ -0,0 +1,49 @@
+# Getting Started
+
+To get started with using Encord Active for active learning, you should choose:
+1. an Encord Active project,
+2. a machine learning model and
+3. an acquisition function.
+
+Also, you need to take into account some basics on **dataset initialization** and **model selection** while you make your choices.
+If you already have these principles covered, you can directly advance to #todo.
+
+
+## Dataset initialization
+
+In the active learning paradigm your model selects examples to be labeled, however, to make these selections you need a model from which you can get useful representations or uncertainty metrics - a model that already “knows” something about the data.
+
+This is typically accomplished by training an initial model on a random subset of the training data. You would want to use just enough data to get a model that can make the acquisition function useful to kickstart the active learning process.
+
+Also, **transfer learning** with pre-trained models can further reduce the required size of the seed dataset and accelerate the whole process.
+
+:::tip
+We recommend that initially you separate (not literally) your project data into training, test and validation sets as it’s important to note that the test and validation datasets still need to be selected randomly and annotated in order to have unbiased performance estimates.
+:::
+
+
+## Model selection
+
+Selecting a model for active learning is not a straightforward task.
+
+Often this is done primarily with domain knowledge rather than validating models with data.
+For example, searching over architectures and hyperparameters using the initial seed training set.
+However, models that perform better in this limited data setting are not likely to be the best performing once you’ve labeled 10x as many examples.
+You should avoid using those models to select your data.
+
+Instead, you should select data that optimizes the performance of your final model.
+So you want to use the type of model that you expect to perform best on your task in general.
+
+
+## Acquisition function selection
+
+
+
+
+## Plug the model into an acquisition metric
+
+
+## What's next?
+
+
+talk about stopping criterion
diff --git a/docs/docs/active-learning/index.mdx b/docs/docs/active-learning/index.mdx
@@ -0,0 +1,29 @@
+# Active Learning
+
+The annotation process can sometimes be extensively time-consuming and expensive.
+Images and videos can often be scraped or even taken automatically, however labeling for tasks like segmentation and motion detection is laborious.
+Some domains, such as medical imaging, require domain knowledge from experts with limited accessibility.
+
+When the unlabeled data is abundant, wouldn’t it be nice if you could pick out the 5% of samples most useful to your model, rather than labeling large swathes of redundant data points?
+This is the idea behind active learning.
+
+**Encord Active** provides you with the tools to take advantage of the active learning method, and it's integrated with **Encord Annotate** to deliver the best annotation experience.
+
+If you are already familiar with the active learning foundation, continue your read with an exploration of **Encord Active**'s acquisition functions and common workflows.
+
+import DocCardList from "@theme/DocCardList";
+
+<DocCardList />
+
+## What is active learning?
+
+Active learning is an iterative process where a [machine learning model](https://encord.com/blog/introduction-to-building-your-first-machine-learning) is used to select the best examples to be labeled next.
+After annotation, the model is retrained on the new, larger dataset, then selects more data to be labeled until reaching a stopping criterion.
+This process is illustrated in the figure below.
+
+![active-learning-cycle.svg](../images/active-learning/active-learning-cycle.svg)
+
+
+Check out our [practical guide to active learning for computer vision](https://encord.com/blog/a-practical-guide-to-active-learning-for-computer-vision/) to learn more about active learning, its tradeoffs, alternatives and a comprehensive explanation on active learning pipelines.
+
+