Add precision #1326

MohsenTaheriShalmani · 2025-09-20T09:31:24Z

Since the main objective is to detect HTTP 400 responses, precision is a more meaningful metric than accuracy. Accuracy can be misleading if 400s are rare, while precision reflects how reliable the model’s 400 predictions are.
Additionally, it may be more sensible to use precision in model verification rather than expecting perfect performance.

arcuri82 · 2025-09-22T07:27:49Z

core/src/main/kotlin/org/evomaster/core/problem/rest/classifier/ModelAccuracy.kt

+    /**
+     * Estimate the precision of the model
+     */
+    fun estimatePrecision400(): Double


if we provide both measures, we need to expand the comments to clarify what is the difference between accuracy and precision here

arcuri82 · 2025-09-22T07:31:07Z

core/src/main/kotlin/org/evomaster/core/problem/rest/classifier/ModelAccuracy.kt

     * If a new input is created based on a rejected prediction, when we would not know if that was correct or not.
     */
-    fun updatePerformance(predictionWasCorrect: Boolean)
+    fun updatePerformance(predictedStatusCode: Int, actualStatusCode: Int? = null)


I assume here the semantic is predictionWasCorrect = (predictedStatusCode == actualStatusCode) ? if so, how can actualStatusCode be nullable?

arcuri82 · 2025-09-22T07:34:28Z

core/src/main/kotlin/org/evomaster/core/problem/rest/classifier/ModelMetricsWithTimeWindow.kt

+        val tp = truePositive400Queue.count { it }   // count only true entries
+        val fp = falsePositive400Queue.count { it }  // count only true entries
+        return if ((tp + fp) > 0) tp.toDouble() / (tp + fp) else 0.0
+    }


we need some basic unit tests to validate these computations (eg in ModelAccuracyWithTimeWindowTest.kt)

arcuri82 · 2025-09-22T07:36:33Z

core/src/main/kotlin/org/evomaster/core/problem/rest/classifier/ModelMetricsWithTimeWindow.kt

+            truePositive400Queue.add(predictionWasCorrect)
+        } else {
+            falsePositive400Queue.add(!predictionWasCorrect)
+        }


is this correct? at each call of updatePerformance, should both data structures truePositive400Queue and falsePositive400Queue be updated? regardless, add unit tests to validate the precision calculations

Nice catch. I'll fix it.

arcuri82 · 2025-09-22T07:37:43Z

...rc/test/kotlin/org/evomaster/core/problem/rest/classifier/ModelAccuracyWithTimeWindowTest.kt


-        ma.updatePerformance(true)
+        ma.updatePerformance(predictedStatusCode = 200, actualStatusCode = 200) //true
        assertEquals(1.0, ma.estimateAccuracy(), delta)


i guess here you could add assertions on the estimated precision

arcuri82 · 2025-09-23T11:50:35Z

@MohsenTaheriShalmani is this ready for me to review? or are you still making changes?
btw, recall that, if you are waiting for my review and that is blocking you from working further, you can do new changes in a branch of the branch while I am reviewing the first one :)

arcuri82 · 2025-09-25T10:48:58Z

...otlin/org/evomaster/core/problem/rest/classifier/deterministic/Deterministic400Classifier.kt

+    override fun estimateMetrics(endpoint: Endpoint): ModelEvaluation {
+        val m = models[endpoint] ?: return ModelEvaluation(
+            accuracy = 0.5,
+            precision400 = 0.5,


why 0.5 as default values and not 0.0?

arcuri82 · 2025-09-25T10:51:10Z

...otlin/org/evomaster/core/problem/rest/classifier/deterministic/Deterministic400Classifier.kt

-    override fun estimateOverallAccuracy(): Double {
+    override fun estimateOverallMetrics(): ModelEvaluation {
+        if (models.isEmpty()) {
+            return ModelEvaluation(


duplicated code. should have a default immutable instance for example declared and returned directly from ModelEvaluaton, eg something like ModelEvaluation.DEFAULT_NO_DATA

arcuri82 · 2025-09-25T10:51:37Z

...otlin/org/evomaster/core/problem/rest/classifier/deterministic/Deterministic400Classifier.kt

        val n = models.size.toDouble()
-        val sum = models.values.sumOf { it.estimateOverallAccuracy() }
+        val total = models.values.map {
+            it?.estimateOverallMetrics() ?: ModelEvaluation(0.5, 0.5, 0.0, 0.0, 0.0)


duplicated code

arcuri82 · 2025-09-25T10:52:54Z

...in/org/evomaster/core/problem/rest/classifier/deterministic/Deterministic400EndpointModel.kt

+                precision400 = 0.5,
+                recall400 = 0.0,
+                f1Score400 = 0.0,
+                mcc = 0.0


duplicated code

arcuri82 · 2025-09-25T10:54:17Z

...in/org/evomaster/core/problem/rest/classifier/deterministic/Deterministic400EndpointModel.kt

+        val acc = modelMetrics.estimateAccuracy()
+        val prec = modelMetrics.estimatePrecision400()
+        val rec = modelMetrics.estimateRecall400()
+        val f1 = if (prec + rec == 0.0) 0.0 else 2 * (prec * rec) / (prec + rec)


is f1 derived from prec and rec? if so, why having it as input, and not computed directly inside ModelMetrics?

arcuri82 · 2025-09-25T11:16:32Z

core/src/main/kotlin/org/evomaster/core/problem/rest/classifier/ModelMetricsWithTimeWindow.kt

+    override fun estimateF1Score400(): Double {
+        val p = estimatePrecision400()
+        val r = estimateRecall400()
+        return if (p + r > 0.0) 2 * (p * r) / (p + r) else 0.0


duplicated equation

arcuri82 · 2025-09-25T11:17:15Z

core/src/main/kotlin/org/evomaster/core/problem/rest/classifier/ModelMetricsWithTimeWindow.kt

+     * Unified metrics estimate packaged into a [ModelEvaluation].
+     */
+    override fun estimateMetrics(): ModelEvaluation {
+        return ModelEvaluation(


this could be the default implementation directly in the interface ModelEvaluation.

arcuri82 · 2025-09-25T11:21:01Z

core/src/main/kotlin/org/evomaster/core/problem/rest/classifier/ModelMetricsWithTimeWindow.kt

+            actualStatusCode == 400 && predictedStatusCode != 400 -> falseNegative400Queue.add(true)
+            actualStatusCode != 400 && predictedStatusCode == 400 -> falsePositive400Queue.add(true)
+            actualStatusCode != 400 && predictedStatusCode != 400 -> trueNegative400Queue.add(true)
+        }


i m not sure these computations are correct, if we want to have a timewindow of size N.
regardless of the number of queues, only half of them get new data at each call of this function. and only with value true.
i think all queues should be updated (eg, consider the false cases as well)

arcuri82 · 2025-09-25T11:22:10Z

core/src/main/kotlin/org/evomaster/core/problem/rest/service/AIResponseClassifier.kt

-                override fun estimateAccuracy(endpoint: Endpoint): Double  = 0.0
-                override fun estimateOverallAccuracy(): Double = 0.0
+                override fun estimateMetrics(endpoint: Endpoint): ModelEvaluation  = ModelEvaluation(accuracy = 0.5,precision400 = 0.5,recall400 = 0.0,f1Score400 = 0.0,mcc = 0.0)
+                override fun estimateOverallMetrics(): ModelEvaluation  = ModelEvaluation(accuracy = 0.5,precision400 = 0.5,recall400 = 0.0,f1Score400 = 0.0,mcc = 0.0)


duplicated code. this default instance should really be instantiated only once and referred in all thes different places. so that, if we want to change the default of one parameter, we do it only once, and not 20 times :)

arcuri82 · 2025-09-25T11:25:43Z

...src/test/kotlin/org/evomaster/core/problem/rest/classifier/ModelMetricsWithTimeWindowTest.kt

+    fun estimateModelMetrics() {
+
+        val delta = 0.0001
+        val ma = ModelMetricsWithTimeWindow(10) // buffer large enough to hold all


besides this test, add back previous deleted test, where val ma = ModelAccuracyWithTimeWindow(4).
that was done on purpose to verify the time window.
as rule of thumb, do not delete or modify existing tests, add new ones (of course if needed, changes/deletes can be discussed).
as wrote in a previous comment, I do not think the timewindow implementation is correct. to verify it, need a buffer that is smaller than the data you insert

arcuri82 · 2025-09-26T07:06:11Z

...in/org/evomaster/core/problem/rest/classifier/deterministic/Deterministic400EndpointModel.kt

+        }

-        return estimateOverallAccuracy()
+        return ModelEvaluation(


i don't understand this change. why removing the call to estimateOverallAccuracy()?

I think for an endpoint model, “overall” is the same as the endpoint’s metrics, so I just return its own metrics directly.

arcuri82 · 2025-09-26T07:09:02Z

...vomaster/core/problem/rest/classifier/probabilistic/AbstractProbabilistic400EndpointModel.kt

+            modelMetrics.estimatePrecision400(),
+            modelMetrics.estimateRecall400(),
+            modelMetrics.estimateMCC400()
+        )


same as in other case. a model for a specific endpoint will have no difference between estimateMetrics(endpoint: Endpoint) and estimateOverallMetrics() apart from the endpoint validation in the former. so one function should call the other to avoid code duplication. also, what is the difference between:

return ModelEvaluation( modelMetrics.estimateAccuracy(), modelMetrics.estimatePrecision400(), modelMetrics.estimateRecall400(), modelMetrics.estimateMCC400() )

and

return modelMetrics.estimateMetrics()

?

arcuri82 · 2025-09-26T07:12:32Z

core/src/main/kotlin/org/evomaster/core/problem/rest/classifier/ModelMetricsFullHistory.kt

+
+    /** Unified metrics estimate packaged into a [ModelEvaluation] */
+    override fun estimateMetrics(): ModelEvaluation {
+        return ModelEvaluation(


why does this need to be overridden if the implementation is the same?

arcuri82 · 2025-09-26T07:13:32Z

core/src/main/kotlin/org/evomaster/core/problem/rest/classifier/ModelMetricsWithTimeWindow.kt

+    override fun estimateMetrics(): ModelEvaluation {
+        return ModelEvaluation(
+            accuracy = estimateAccuracy(),
+            precision400 = estimatePrecision400(),


see previous comment. if the implementation is the same, then no need to override

arcuri82 · 2025-09-26T07:15:52Z

...src/test/kotlin/org/evomaster/core/problem/rest/classifier/ModelMetricsWithTimeWindowTest.kt

+        val ma = ModelMetricsWithTimeWindow(10) // buffer large enough to hold all
+
+        // 1) correct prediction: 200 vs. 200 -> TN
+        ma.updatePerformance(predictedStatusCode = 200, actualStatusCode = 200)


keep this test. but add a new one (can copy&paste some parts of this test)in which the amount of data is more than the buffer. eg you could have a buffer of size 2, and insert the first 5 elements in this test

Update to add precision

cce3eef

MohsenTaheriShalmani requested a review from arcuri82 September 20, 2025 09:31

Base automatically changed from Packing-Probabilistic-AI-Models to master September 22, 2025 07:17

Merge branch 'master' into Adding-Precision-To-AIModels

e9bf25e

arcuri82 requested changes Sep 22, 2025

View reviewed changes

Update and debug

517f092

MohsenTaheriShalmani requested a review from arcuri82 September 22, 2025 19:29

Cleanup plus adding Recall, F1Score, and MCC metrics

7f5bc92

Merge branch 'master' into Adding-Precision-To-AIModels

025589b

arcuri82 requested changes Sep 25, 2025

View reviewed changes

MohsenTaheriShalmani added 3 commits September 25, 2025 15:55

Cleanup

dc11205

Merge branch 'master' into Adding-Precision-To-AIModels

ea296f6

Cleanup

b57441b

MohsenTaheriShalmani requested a review from arcuri82 September 25, 2025 17:12

arcuri82 requested changes Sep 26, 2025

View reviewed changes

Removing duplicate code

f78c853

MohsenTaheriShalmani requested a review from arcuri82 September 26, 2025 11:21

Add precision #1326

Are you sure you want to change the base?

Add precision #1326

Uh oh!

Conversation

MohsenTaheriShalmani commented Sep 20, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

arcuri82 commented Sep 23, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!