Consuming iterators/generator multiple times #56

thorwhalen · 2023-05-14T12:03:21Z

thorwhalen
May 14, 2023
Maintainer

Problem: Consuming an iterator multiple times

Consider the situation where the output of a function is a generator that will loop through many audio files and create a stream of waveform "chunks". If there's more than one function consuming the iterator, we'll have a problem. In a serial execution of a DAG, say, one of the functions will consume the iterator entirely, and than the second will find itself with an empty iterator. Parallel execution would arguably be worse since both would only get an (unpredictable) subset of the chunks.

Solutions

`tee`

The builtin solution for this is itertools.tee, but the way tee is implemented has many problems in our context. They would require a lot of memory if the iterators are not all consumed at a similar pace. Also, they're not thread-safe.

Though tee may be the right solution in some situations, we'll need another solution handy for when it's not.

`dol.tee`

This doesn't exist (yet), but such a solution would be a version of tee (see code below) that uses persisting deques.
Or rather, deques that source themselves from a persisted store that is fed by the original iterator.

def tee(iterable, n=2):
    it = iter(iterable)
    deques = [collections.deque() for i in range(n)]
    def gen(mydeque):
        while True:
            if not mydeque:             # when the local deque is empty
                try:
                    newval = next(it)   # fetch a new value and
                except StopIteration:
                    return
                for d in deques:        # load it to all the deques
                    d.append(newval)
            yield mydeque.popleft()
    return tuple(gen(d) for d in deques)

We could enable the storage to be flushed once it is not needed anymore (like garbage collection), and use temp storage anyway, in case the flush doesn't happen for some reason.

Still, the more ideal situation for this solution would be if we're caching the iterators values anyway, so there's no real waste.

callback/factory

Another solution would go back to the source of the iterator and have it regenerate it, when this is possible.

One method to do this, in a DAG, say, is that when the DAG is "compiled" (validation, topological ordering, etc.), when multiple consumption of an iterator is "detected", the function producing the iterator is "command-i-fied" (that is, we remember the function and it's arguments, but don't execute right a way) and only executed when the value (the iterator) of the function (e.g. the generator) is needed. That way, we generate a new "copy" of the iterator for each function needing it.

This can be done at the level of the scope MutableMapping itself. For example, we register the 'i_am_iterator' var node so that a second read of scope['i_am_iterator'] doesn't give it the value (which might be an empty iterator), but calls the original producer again to make a new iterator.

How does the DAG detect multiple consumption? Could be (a var node is connected to more than one func node and...)

we told it so (e.g. wrapping the generated function in an Iterize class, or specifying this information in a var_node info dict)
through annotations (the return annotation or the arg annotation of the bind are Generator or Iterator)
the function's type is typing.Generator

What I don't like about this solution is that it is not local, like the first two. In meshed, we have our contextual objects (DAG, etc.), but still, I favor local solutions since they have better SoC. The complexity's solution is brought by the object that carries the problem in the first place -- we don't need to extend (therefore complexify) the DAG to handle those cases if and when they happen.

thorwhalen · 2023-08-17T19:15:11Z

thorwhalen
Aug 17, 2023
Maintainer Author

Pasting this here, since it cam up again:

CD says this in his log:

Focused my implementation on the use of generators and built-in functions. Problem: two funcNodes can’t share the same generator. I’m wondering if DAGs could internally use itertools.tee when they see several funcNodes share the same generator..

We’ve discussed this before (and hopefully recorded our deliberations and links to resources for it).
Just noting when a pattern/need we’ve discussed comes up, so we reinforce the need.

CD: Yes, we can use itertools.tee, but also, you can use it yourself, as a func node, or in your “prepared” function itself. The problem with tee is that it uses memory to do it’s work, which is not scalable.

Another trick you can use, CA, is to not use the generator as your input, but the generator (factory) function. That way, each component that needs the generator, creates one for itself.

But of course, we want to make it easy for a unskilled user to take care of this without knowing the recipe.

Therefore we need to figure out a tool for this problem that is scalable. One proposal is to have something like tee, but where we have control on where and how the “caching” (because it’s what this is, again) is done. For instance, in a persisted store. But could also be (the mapping interface of stores offers this possibility as well) instructions on how to recompute the generator.

2 replies

charlie-dufort Aug 18, 2023

Although I really appreciate you confusing me with CA, whom I know you like and value a lot, I'd like to clarify that my initials are actually CD 😉

thorwhalen Sep 11, 2023
Maintainer Author

I'm sorry for the unforgivable typo.
A is so much easier to type after C, given it's a different finger!
;)

thorwhalen · 2023-09-11T15:50:44Z

thorwhalen
Sep 11, 2023
Maintainer Author

Note: Some discussion (with code) in the "Dealing with iterables in DAGs" section of the meshed - ideas notebook.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consuming iterators/generator multiple times #56

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Consuming iterators/generator multiple times #56

Uh oh!

thorwhalen May 14, 2023 Maintainer

Problem: Consuming an iterator multiple times

Solutions

tee

dol.tee

callback/factory

Replies: 2 comments · 2 replies

Uh oh!

Uh oh!

thorwhalen Aug 17, 2023 Maintainer Author

Uh oh!

charlie-dufort Aug 18, 2023

Uh oh!

thorwhalen Sep 11, 2023 Maintainer Author

Uh oh!

thorwhalen Sep 11, 2023 Maintainer Author

thorwhalen
May 14, 2023
Maintainer

`tee`

`dol.tee`

Replies: 2 comments 2 replies

thorwhalen
Aug 17, 2023
Maintainer Author

thorwhalen Sep 11, 2023
Maintainer Author

thorwhalen
Sep 11, 2023
Maintainer Author