Consuming iterators/generator multiple times #56
Replies: 2 comments 2 replies
-
|
Pasting this here, since it cam up again: CD says this in his log: Focused my implementation on the use of generators and built-in functions. Problem: two funcNodes can’t share the same generator. I’m wondering if We’ve discussed this before (and hopefully recorded our deliberations and links to resources for it). CD: Yes, we can use Another trick you can use, CA, is to not use the generator as your input, but the generator (factory) function. That way, each component that needs the generator, creates one for itself. But of course, we want to make it easy for a unskilled user to take care of this without knowing the recipe. Therefore we need to figure out a tool for this problem that is scalable. One proposal is to have something like tee, but where we have control on where and how the “caching” (because it’s what this is, again) is done. For instance, in a persisted store. But could also be (the mapping interface of stores offers this possibility as well) instructions on how to recompute the generator. |
Beta Was this translation helpful? Give feedback.
-
|
Note: Some discussion (with code) in the "Dealing with iterables in DAGs" section of the meshed - ideas notebook. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Problem: Consuming an iterator multiple times
Consider the situation where the output of a function is a generator that will loop through many audio files and create a stream of waveform "chunks". If there's more than one function consuming the iterator, we'll have a problem. In a serial execution of a DAG, say, one of the functions will consume the iterator entirely, and than the second will find itself with an empty iterator. Parallel execution would arguably be worse since both would only get an (unpredictable) subset of the chunks.
Solutions
teeThe builtin solution for this is itertools.tee, but the way
teeis implemented has many problems in our context. They would require a lot of memory if the iterators are not all consumed at a similar pace. Also, they're not thread-safe.Though
teemay be the right solution in some situations, we'll need another solution handy for when it's not.dol.teeThis doesn't exist (yet), but such a solution would be a version of
tee(see code below) that uses persisting deques.Or rather, deques that source themselves from a persisted store that is fed by the original iterator.
We could enable the storage to be flushed once it is not needed anymore (like garbage collection), and use temp storage anyway, in case the flush doesn't happen for some reason.
Still, the more ideal situation for this solution would be if we're caching the iterators values anyway, so there's no real waste.
callback/factory
Another solution would go back to the source of the iterator and have it regenerate it, when this is possible.
One method to do this, in a DAG, say, is that when the DAG is "compiled" (validation, topological ordering, etc.), when multiple consumption of an iterator is "detected", the function producing the iterator is "command-i-fied" (that is, we remember the function and it's arguments, but don't execute right a way) and only executed when the value (the iterator) of the function (e.g. the generator) is needed. That way, we generate a new "copy" of the iterator for each function needing it.
This can be done at the level of the
scopeMutableMappingitself. For example, we register the'i_am_iterator'var node so that a second read ofscope['i_am_iterator']doesn't give it the value (which might be an empty iterator), but calls the original producer again to make a new iterator.How does the DAG detect multiple consumption? Could be (a var node is connected to more than one func node and...)
Iterizeclass, or specifying this information in a var_node info dict)GeneratororIterator)typing.GeneratorWhat I don't like about this solution is that it is not local, like the first two. In meshed, we have our contextual objects (DAG, etc.), but still, I favor local solutions since they have better SoC. The complexity's solution is brought by the object that carries the problem in the first place -- we don't need to extend (therefore complexify) the DAG to handle those cases if and when they happen.
Beta Was this translation helpful? Give feedback.
All reactions