Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

openfeature-provider-flagd unexpected behaviour #198

Open
julianocosta89 opened this issue Feb 21, 2025 · 8 comments
Open

openfeature-provider-flagd unexpected behaviour #198

julianocosta89 opened this issue Feb 21, 2025 · 8 comments
Assignees

Comments

@julianocosta89
Copy link

Hello all 👋🏽

I wish I had more details, but unfortunately I don't have any logs about it.
We recently got an issue in the OTel Demo with the user reporting that the Load-Generator wasn't producing load.

I went through all updates and all dependencies bump and got down to openfeature-provider-flagd as the culprit.

There is something going on with the update from 0.1.5 to 0.2.0, BUT just for the load-generator.

We have 2 Python services in the Demo:

Recommendation uses OTel auto-instrumentation and Load-Generator uses OTel instrumentation libraries.

Here are the steps to reproduce:

  • Pull the Demo as is
  • build the load-generator (docker compose build load-generator)
  • run the demo (make start)
  • access Jaeger: http://localhost:8080/jaeger/ui
  • refresh the page a couple of times, while the load-generator is producing traffic and spans are arriving at Jaeger. (You should see 19 services)

Now, stop the Demo: make stop.

  • Update openfeature-provider-flagd to 0.2.0.
  • build the new load-generator (docker compose build load-generator)
  • run the demo (make start)
  • access Jaeger: http://localhost:8080/jaeger/ui
  • refresh the page a couple of times, while the load-generator SHOULD be producing traffic.
  • You will get the max of 5 services, as load-generator is not sending any load.

NOTE that recommendation is on version 0.2.0 and working fine, so this seems to be some race condition or call order in the load-generator service.

@beeme1mr
Copy link
Member

Hey @julianocosta89, we'll take a look. FYI @aepfli

@aepfli
Copy link
Member

aepfli commented Feb 22, 2025

I tried to debug this. Thanks to this great explanation, i can reproduce this easily based on the steps. I will check on monday if I find some more ideas.

things i tried:

  • running the locust file locally with debugger (hangs when we are waiting for the connection ready event of our grpc channel)
  • running the load generator script without locust, like only executing the python script without any other locust dependency, and even multiple scripts in parallel) - no issues
  • to ensure we are not running into a dependency issue, i checked it, and ran it with the same grpc versions etc. like the docker image produces - with locust still hanging, without no problems.

for debugging i used:

I will continue digging ;)

first discovery, we are missing a dependency within our dependency definitions 😱

@aepfli
Copy link
Member

aepfli commented Feb 22, 2025

okay, things are getting interesting, seems like our thread is not starting, but i am not sure why self.thread.start() seems to not execute properly. also time.sleep is not working in this environment as expected.

it is related with:

while not self.connected and time.time() < timeout:
time.sleep(0.05)
logger.debug("Finished blocking gRPC state initialization")
if not self.connected:
raise ProviderNotReadyError(
"Blocking init finished before data synced. Consider increasing startup deadline to avoid inconsistent evaluations."
)

it hangs on the time.sleep - i am not sure if there is a nicer way to wait for this, or if we can actually remove this part of the code. Anyways this seems to be some sort of strange issue which correlates with the execution with locust. we should definitely inspect this, and see if this is something, which we might need to handle differently.

//EDIT:
As soon as i remove the sleep everything "works" - still the provider does not reach an ready state, because the stream connection is not working - but we can resolve flags ;)

@beeme1mr
Copy link
Member

Hey @aepfli, it looks like Locust uses gevent for concurrency. As a test, could you please try gevent.sleep() instead of time.sleep?

@aepfli
Copy link
Member

aepfli commented Feb 22, 2025

i can try, but this means, we need to have a flexible way, or rely on a library which handles this in the future

@beeme1mr
Copy link
Member

i can try, but this means, we need to have a flexible way, or rely on a library which handles this in the future

Yeah, it's just another interesting data point. It looks like Locust may add interesting runtime restrictions. Perhaps a simple http call to flagd would be better in this load gen script.

@aepfli
Copy link
Member

aepfli commented Feb 24, 2025

even with gevent.sleep it is problematic. maybe the ofrep endpoint will be a better option for this tool to fetch feature flags. seems like the monkeypatching from gevents is not working either. But i learned about gevents, and this looks really interesting ;)

@aepfli
Copy link
Member

aepfli commented Feb 24, 2025

summary of findings:

  • locust seems to monkey patch certain things and our new "rpc" and "inprocess" providers are utilizing threading and sometimes sleeps which is not really compatible with locust
  • we do have the OFREP provider which communicates with flagd via polling, which works great without any issue
  • the otel demo specifies a env var FLAGD_PORT and utilizes it for configuration, except on the flagd container 🐛 - i will fix that too

Further suggestions:

The otel demo is great, and utilizes OpenFeature and FlagD heavily. We should utilize this, and generate usecases for different configurations and communication options. Currently we are only using "rpc" mode. soon we are using OFREP too. but we should also highlight other scenarios and describe them in the docs of otel:

  • in process mode
  • rpc mode without caching (ups and downs)
  • OFREP for frontend
  • thinking about some targeting rules (eg one flag different behaviour for services)

We would never be able to run/maintain/create such a great demo, and we could also add benefit, as it clearly shows the observability highlights which you can introduce with openFeature. - i will extract this into an own issue ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants