|
| 1 | +# Python Reference Counting Model for Phlex Transforms |
| 2 | + |
| 3 | +## Overview |
| 4 | + |
| 5 | +Phlex's Python plugin bridges C++ and Python through `intptr_t` values |
| 6 | +that represent `PyObject*` pointers. These values flow through the |
| 7 | +framework's `declared_transform` nodes, which cache their outputs for |
| 8 | +reuse. This document describes the reference counting discipline |
| 9 | +required to prevent use-after-free and memory leaks. |
| 10 | + |
| 11 | +## Architecture |
| 12 | + |
| 13 | +A typical Python transform pipeline looks like this: |
| 14 | + |
| 15 | +```text |
| 16 | +Provider → [input converter] → [Python callback] → [output converter] → Observer/Fold |
| 17 | + (C++ → PyObject) (PyObject → PyObject) (PyObject → C++) |
| 18 | +``` |
| 19 | + |
| 20 | +Each `[…]` above is a `declared_transform` node. The framework caches |
| 21 | +each node's output in a `stores_` map keyed by `data_cell_index::hash()`. |
| 22 | +When multiple events share the same hash (e.g., all events within one |
| 23 | +job), the cached product store is reused without re-running the |
| 24 | +transform. |
| 25 | + |
| 26 | +## The Caching Problem |
| 27 | + |
| 28 | +The product store holds an `intptr_t` representing a `PyObject*`. This |
| 29 | +is an opaque integer to the framework — it has no C++ destructor and no |
| 30 | +way to call `Py_DECREF` on cleanup. This means: |
| 31 | + |
| 32 | +1. **The cached reference is never freed** by the framework. This is an |
| 33 | + accepted, bounded leak (one reference per unique hash per converter). |
| 34 | +2. **Consumers must not free the cached reference.** Any `Py_DECREF` on |
| 35 | + the cached `PyObject*` would free it, leaving a dangling pointer in |
| 36 | + the cache for subsequent events to access. |
| 37 | + |
| 38 | +## Rules |
| 39 | + |
| 40 | +### Rule 1: Input converters create new references |
| 41 | + |
| 42 | +Input converters (`_to_py` functions in `BASIC_CONVERTER` and |
| 43 | +`VECTOR_CONVERTER`) create a **new reference** (refcnt=1) that is |
| 44 | +stored in the product store cache. The cache owns this reference. |
| 45 | + |
| 46 | +```cpp |
| 47 | +// BASIC_CONVERTER: creates new reference via Python C API |
| 48 | +static intptr_t int_to_py(int a) { |
| 49 | + PyGILRAII gil; |
| 50 | + return (intptr_t)PyLong_FromLong(a); // new reference, refcnt=1 |
| 51 | +} |
| 52 | + |
| 53 | +// VECTOR_CONVERTER: creates new PhlexLifeline wrapping a numpy view |
| 54 | +static intptr_t vint_to_py(std::shared_ptr<std::vector<int>> const& v) { |
| 55 | + // ... creates PyArrayObject and PhlexLifeline ... |
| 56 | + return (intptr_t)pyll; // new reference, refcnt=1 |
| 57 | +} |
| 58 | +``` |
| 59 | +
|
| 60 | +### Rule 2: py_callback XINCREF/XDECREF around the Python call |
| 61 | +
|
| 62 | +`py_callback::call()` and `py_callback::callv()` receive `intptr_t` |
| 63 | +args that are **borrowed references** from the upstream product store |
| 64 | +cache. They must create temporary owned references for the duration of |
| 65 | +the Python function call: |
| 66 | +
|
| 67 | +```cpp |
| 68 | +template <typename... Args> |
| 69 | +intptr_t call(Args... args) { |
| 70 | + PyGILRAII gil; |
| 71 | +
|
| 72 | + // Create temporary owned references |
| 73 | + (Py_XINCREF((PyObject*)args), ...); |
| 74 | +
|
| 75 | + PyObject* result = PyObject_CallFunctionObjArgs( |
| 76 | + m_callable, lifeline_transform(args)..., nullptr); |
| 77 | +
|
| 78 | + // Release temporary references; cache references remain intact |
| 79 | + (Py_XDECREF((PyObject*)args), ...); |
| 80 | +
|
| 81 | + return (intptr_t)result; // new reference, owned by output cache |
| 82 | +} |
| 83 | +``` |
| 84 | + |
| 85 | +The `Py_XINCREF`/`Py_XDECREF` pair ensures that even if the Python |
| 86 | +function or garbage collector decrements the object's reference count |
| 87 | +during the call, the cached reference remains valid. The X variants |
| 88 | +handle the case where an upstream converter returned null due to an |
| 89 | +out-of-memory condition. |
| 90 | + |
| 91 | +### Rule 3: Output converters must NOT Py_DECREF their input |
| 92 | + |
| 93 | +Output converters (`py_to_*` functions in `BASIC_CONVERTER` and |
| 94 | +`NUMPY_ARRAY_CONVERTER`) receive **borrowed references** from the |
| 95 | +upstream product store cache. They must not call `Py_DECREF` on the |
| 96 | +input: |
| 97 | + |
| 98 | +```cpp |
| 99 | +// BASIC_CONVERTER py_to_*: extracts C++ value, does NOT decref |
| 100 | +static int py_to_int(intptr_t pyobj) { |
| 101 | + PyGILRAII gil; |
| 102 | + int i = (int)PyLong_AsLong((PyObject*)pyobj); |
| 103 | + // NO Py_DECREF — input is borrowed from cache |
| 104 | + return i; |
| 105 | +} |
| 106 | + |
| 107 | +// NUMPY_ARRAY_CONVERTER py_to_*: copies array data, does NOT decref |
| 108 | +static std::shared_ptr<std::vector<int>> py_to_vint(intptr_t pyobj) { |
| 109 | + PyGILRAII gil; |
| 110 | + auto vec = std::make_shared<std::vector<int>>(); |
| 111 | + // ... copy data from PyArray or PyList ... |
| 112 | + // NO Py_DECREF — input is borrowed from cache |
| 113 | + return vec; |
| 114 | +} |
| 115 | +``` |
| 116 | +
|
| 117 | +### Rule 4: lifeline_transform returns a borrowed reference |
| 118 | +
|
| 119 | +`lifeline_transform()` unwraps `PhlexLifeline` objects to extract the |
| 120 | +numpy array view. It returns a borrowed reference in both cases: |
| 121 | +
|
| 122 | +- If the arg is a `PhlexLifeline`, it returns `m_view` (a borrowed |
| 123 | + reference from the lifeline object, which stays alive because the |
| 124 | + caller holds a temporary INCREF on it per Rule 2). |
| 125 | +- If the arg is a plain `PyObject`, it returns the arg itself (a |
| 126 | + borrowed reference from the product store cache, protected by the |
| 127 | + INCREF per Rule 2). |
| 128 | +
|
| 129 | +`lifeline_transform()` is used symmetrically in both `call()` and |
| 130 | +`callv()`. |
| 131 | +
|
| 132 | +### Rule 5: VECTOR_CONVERTER must throw on error, never return null |
| 133 | +
|
| 134 | +`VECTOR_CONVERTER` error paths must throw `std::runtime_error` instead |
| 135 | +of returning `(intptr_t)nullptr`. A null `intptr_t` passed to |
| 136 | +`PyObject_CallFunctionObjArgs` acts as the argument-list sentinel, |
| 137 | +silently truncating the argument list and causing the Python function to |
| 138 | +receive fewer arguments than expected. |
| 139 | +
|
| 140 | +### Rule 6: declared_transform must erase stale cache entries |
| 141 | +
|
| 142 | +`declared_transform::stores_.insert()` creates an entry with a null |
| 143 | +`product_store_ptr`. If the transform's `call()` throws before |
| 144 | +assigning `a->second`, the null entry persists in the cache. Subsequent |
| 145 | +events with the same hash hit the `else` branch and propagate the null |
| 146 | +product store downstream, causing SEGFAULTs when downstream converters |
| 147 | +attempt to use it. |
| 148 | +
|
| 149 | +Fix: wrap the transform body in `try/catch` and erase the stale entry |
| 150 | +on exception: |
| 151 | +
|
| 152 | +```cpp |
| 153 | +if (stores_.insert(a, hash)) { |
| 154 | + try { |
| 155 | + // ... compute and assign a->second ... |
| 156 | + } catch (...) { |
| 157 | + stores_.erase(a); |
| 158 | + throw; |
| 159 | + } |
| 160 | +} |
| 161 | +``` |
| 162 | + |
| 163 | +## Reference Flow Diagram |
| 164 | + |
| 165 | +```text |
| 166 | + ┌──────────────┐ |
| 167 | + │ Provider │ C++ value (int, float, vector<T>) |
| 168 | + └──────┬───────┘ |
| 169 | + │ |
| 170 | + ┌──────▼───────┐ |
| 171 | + │ input conv. │ Creates NEW PyObject* reference (refcnt=1) |
| 172 | + │ (e.g. int_ │ Stored in product_store cache |
| 173 | + │ to_py) │ Cache OWNS this reference |
| 174 | + └──────┬───────┘ |
| 175 | + │ intptr_t (PyObject*, borrowed from cache) |
| 176 | + ┌──────▼───────┐ |
| 177 | + │ py_callback │ XINCREF args (refcnt: 1→2) |
| 178 | + │ ::call() │ Call Python function |
| 179 | + │ │ XDECREF args (refcnt: 2→1, cache ref intact) |
| 180 | + │ │ Return result (NEW reference, refcnt=1) |
| 181 | + └──────┬───────┘ |
| 182 | + │ intptr_t (PyObject*, borrowed from cache) |
| 183 | + ┌──────▼───────┐ |
| 184 | + │ output conv. │ Reads PyObject* value |
| 185 | + │ (e.g. py_to_ │ Does NOT Py_DECREF |
| 186 | + │ int) │ Returns C++ value |
| 187 | + └──────┬───────┘ |
| 188 | + │ |
| 189 | + ┌──────▼───────┐ |
| 190 | + │ Observer │ Uses C++ value |
| 191 | + └──────────────┘ |
| 192 | +``` |
| 193 | + |
| 194 | +## Why Small Integers Mask the Bug |
| 195 | + |
| 196 | +CPython caches small integers (-5 to 256) as immortal singletons. In |
| 197 | +Python 3.12+, these have effectively infinite reference counts. An |
| 198 | +incorrect `Py_DECREF` on a cached integer does not free it, so the |
| 199 | +dangling pointer in the product store cache still points to a valid |
| 200 | +object. This is why tests using only small integers (like `py:types`) |
| 201 | +can pass even with incorrect reference counting. |
| 202 | + |
| 203 | +Tests using floats (`py:coverage`), non-cached integers, or |
| 204 | +`PhlexLifeline` objects (`py:vectypes`, `py:veclists`) expose the bug |
| 205 | +because these objects have normal reference counts and are freed on |
| 206 | +`Py_DECREF` to zero. |
0 commit comments