openfeature-provider-flagd unexpected behaviour #198

julianocosta89 · 2025-02-21T18:43:05Z

Hello all 👋🏽

I wish I had more details, but unfortunately I don't have any logs about it.
We recently got an issue in the OTel Demo with the user reporting that the Load-Generator wasn't producing load.

I went through all updates and all dependencies bump and got down to openfeature-provider-flagd as the culprit.

There is something going on with the update from 0.1.5 to 0.2.0, BUT just for the load-generator.

We have 2 Python services in the Demo:

Recommendation uses OTel auto-instrumentation and Load-Generator uses OTel instrumentation libraries.

Here are the steps to reproduce:

Pull the Demo as is
build the load-generator (docker compose build load-generator)
run the demo (make start)
access Jaeger: http://localhost:8080/jaeger/ui
refresh the page a couple of times, while the load-generator is producing traffic and spans are arriving at Jaeger. (You should see 19 services)

Now, stop the Demo: make stop.

Update openfeature-provider-flagd to 0.2.0.
build the new load-generator (docker compose build load-generator)
run the demo (make start)
access Jaeger: http://localhost:8080/jaeger/ui
refresh the page a couple of times, while the load-generator SHOULD be producing traffic.
You will get the max of 5 services, as load-generator is not sending any load.

NOTE that recommendation is on version 0.2.0 and working fine, so this seems to be some race condition or call order in the load-generator service.

The text was updated successfully, but these errors were encountered:

beeme1mr · 2025-02-21T20:41:53Z

Hey @julianocosta89, we'll take a look. FYI @aepfli

aepfli · 2025-02-22T13:29:58Z

I tried to debug this. Thanks to this great explanation, i can reproduce this easily based on the steps. I will check on monday if I find some more ideas.

things i tried:

running the locust file locally with debugger (hangs when we are waiting for the connection ready event of our grpc channel)
running the load generator script without locust, like only executing the python script without any other locust dependency, and even multiple scripts in parallel) - no issues
to ensure we are not running into a dependency issue, i checked it, and ran it with the same grpc versions etc. like the docker image produces - with locust still hanging, without no problems.

for debugging i used:

https://docs.locust.io/en/stable/running-in-debugger.html with vscode

I will continue digging ;)

first discovery, we are missing a dependency within our dependency definitions 😱

aepfli · 2025-02-22T14:17:14Z

okay, things are getting interesting, seems like our thread is not starting, but i am not sure why self.thread.start() seems to not execute properly. also time.sleep is not working in this environment as expected.

it is related with:

python-sdk-contrib/providers/openfeature-provider-flagd/src/openfeature/contrib/provider/flagd/resolvers/grpc.py

Lines 126 to 133 in b0dac08

    
           while not self.connected and time.time() < timeout: 
        
               time.sleep(0.05) 
        
           logger.debug("Finished blocking gRPC state initialization") 
        
           if not self.connected: 
        
               raise ProviderNotReadyError( 
        
                   "Blocking init finished before data synced. Consider increasing startup deadline to avoid inconsistent evaluations." 
        
               )

it hangs on the time.sleep - i am not sure if there is a nicer way to wait for this, or if we can actually remove this part of the code. Anyways this seems to be some sort of strange issue which correlates with the execution with locust. we should definitely inspect this, and see if this is something, which we might need to handle differently.

//EDIT:
As soon as i remove the sleep everything "works" - still the provider does not reach an ready state, because the stream connection is not working - but we can resolve flags ;)

beeme1mr · 2025-02-22T17:01:31Z

Hey @aepfli, it looks like Locust uses gevent for concurrency. As a test, could you please try gevent.sleep() instead of time.sleep?

aepfli · 2025-02-22T17:38:31Z

i can try, but this means, we need to have a flexible way, or rely on a library which handles this in the future

beeme1mr · 2025-02-22T17:59:19Z

i can try, but this means, we need to have a flexible way, or rely on a library which handles this in the future

Yeah, it's just another interesting data point. It looks like Locust may add interesting runtime restrictions. Perhaps a simple http call to flagd would be better in this load gen script.

aepfli · 2025-02-24T07:50:24Z

even with gevent.sleep it is problematic. maybe the ofrep endpoint will be a better option for this tool to fetch feature flags. seems like the monkeypatching from gevents is not working either. But i learned about gevents, and this looks really interesting ;)

aepfli · 2025-02-24T10:44:24Z

summary of findings:

locust seems to monkey patch certain things and our new "rpc" and "inprocess" providers are utilizing threading and sometimes sleeps which is not really compatible with locust
we do have the OFREP provider which communicates with flagd via polling, which works great without any issue
the otel demo specifies a env var FLAGD_PORT and utilizes it for configuration, except on the flagd container 🐛 - i will fix that too

Further suggestions:

The otel demo is great, and utilizes OpenFeature and FlagD heavily. We should utilize this, and generate usecases for different configurations and communication options. Currently we are only using "rpc" mode. soon we are using OFREP too. but we should also highlight other scenarios and describe them in the docs of otel:

in process mode
rpc mode without caching (ups and downs)
OFREP for frontend
thinking about some targeting rules (eg one flag different behaviour for services)

We would never be able to run/maintain/create such a great demo, and we could also add benefit, as it clearly shows the observability highlights which you can introduce with openFeature. - i will extract this into an own issue ;)

julianocosta89 mentioned this issue Feb 21, 2025

Load Generator (locust) doesn't make the application emit traces open-telemetry/opentelemetry-demo#2051

Open

beeme1mr assigned aepfli Feb 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

openfeature-provider-flagd unexpected behaviour #198

openfeature-provider-flagd unexpected behaviour #198

julianocosta89 commented Feb 21, 2025

beeme1mr commented Feb 21, 2025

aepfli commented Feb 22, 2025 •

edited

Loading

aepfli commented Feb 22, 2025 •

edited

Loading

beeme1mr commented Feb 22, 2025

aepfli commented Feb 22, 2025

beeme1mr commented Feb 22, 2025

aepfli commented Feb 24, 2025

aepfli commented Feb 24, 2025

openfeature-provider-flagd unexpected behaviour #198

openfeature-provider-flagd unexpected behaviour #198

Comments

julianocosta89 commented Feb 21, 2025

beeme1mr commented Feb 21, 2025

aepfli commented Feb 22, 2025 • edited Loading

aepfli commented Feb 22, 2025 • edited Loading

beeme1mr commented Feb 22, 2025

aepfli commented Feb 22, 2025

beeme1mr commented Feb 22, 2025

aepfli commented Feb 24, 2025

aepfli commented Feb 24, 2025

summary of findings:

Further suggestions:

aepfli commented Feb 22, 2025 •

edited

Loading

aepfli commented Feb 22, 2025 •

edited

Loading