Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strategy for safe integration of new version of mc-providers (with direct ES support) #936

Open
philbudne opened this issue Jan 15, 2025 · 1 comment

Comments

@philbudne
Copy link
Contributor

N.B. This is a super-issue that may involve work (sub-issues) in both web-search (and two different branches of mc-providers), but I'm posting it here 'cause it's too big for a slack message!

Thoughts on ES-based mc-providers/web-search integration.

To mitigate risks that falling back to news-search-api will be
necessary, make it easy to fall back (and hope that this
preparation dissipates heisenbugs and other evil spirits, and makes it
unnecessary to fall back):

  1. Add a run-time (constance?) emergency pull-cord, checked in
    web-search/mcweb/backend/search/utils.py pq_provider function that
    (if the cord is pulled) directs any requests for the "mediacloud"
    provider to "mediacloud-old" (the NSA provider in the new version
    of the library).

  2. In the more severe case that even the "old" provider in the new
    mc-providers library doesn't work well enough, make a 2.2.1
    release of the current mc-providers library that accepts:

    a. new/extra args to provider_by_name (software_id, session_id)
    b. "mediacloud-old" name as an alias for "mediacloud"

    So that fall back to the (very) lightly modified version
    of the old library by editing of web-search/requirements.txt is
    possible.

Some known integration issues/areas with the ES based provider
OnlineNewsMediaCloudESProvider:

  1. Returns empty results when an error is detected.

    One such error condition is when a query fails for one or more
    shards (circuit breaker or other error); N-S-A returns partial
    results, the ESProvider detects the error, logs it, and returns
    empty results. I've seen that ES returns such errors when heap
    utilization is high.

    We either need to:
    a. Understand why this happens
    i. It's possible it's due to "top terms" aggregation queries,
    which often fail due to circuit breaker errors.
    A temporary run-time switch in web-search to turn
    of top terms might help us determine if this is the case.
    ii. It might be possible to try to infer possible causes
    looking at monitoring data (but that's a lot of conditionals
    in a row!)
    b. Hope that the new cluster hardware and removal of fielddata
    cures the problem, make it possible to have the provider return
    partial results (ignore errors), and have a temporary
    (environment variable based) switch to silently return data
    from queries which retruned incomplete data (failed for some shards).

  2. Exception handling

    a. There may be situations where errors detected by ESProvider that
    should be reported by raising an Exception rather than silently
    returning empty results (see above).

    b. There are almost certainly exceptions raised by
    the elasticsearch library, including (but not limited to):

    elasticsearch.BadRequestError: BadRequestError(400, 'search_phase_execution_exception', 'parent task was cancelled [Fatal failure during search: failed to merge result [[parent] Data too large, data for [<reduce_aggs>] would be [30615713720/28.5gb], which is larger than the limit of [30601641984/28.5gb], real usage: [30615712304/28.5gb], new bytes reserved: [1416/1.3kb], usages [eql_sequence=0/0b, fielddata=5418974373/5gb, request=2131429/2mb, inflight_requests=6744/6.5kb, model_inference=0/0b]]]')

    and

    elasticsearch.ApiError: ApiError(429, 'circuit_breaking_exception', '[parent] Data too large, data for [<http_request>] would be [30641554912/28.5gb], which is larger than the limit of [30601641984/28.5gb], real usage: [30641548168/28.5gb], new bytes reserved: [6744/6.5kb], usages [eql_sequence=0/0b, fielddata=10441141167/9.7gb, request=0/0b, inflight_requests=6744/6.5kb, model_inference=0/0b]')

    Such errors will need to be handled in ways that provide useful/helpful/comprehendable errors to web-search users
    (while also logging detailed information for developers to look at to try and diganose the problem).

    It's up to debate whether this should be done by catching
    errors in mc-providers re-raising exceptions with user-consumable
    messages, or whether the catching and translation goes into
    web-search middleware.

    The former better isolates concerns, but it complicates
    remediation of problems (requires two components to be
    changed/released).

    At the present, web-search is the only software that calls
    mc-providers search methods, so it's not completely
    unreasonable to (at least initially) perform the error
    translations in web-search (SO LONG AS THEY'RE WELL
    LOCALIZED, AND NOT SPREAD FAR AND WIDE!!!
    )

@philbudne
Copy link
Contributor Author

Correction, provider exceptions are handled by the @handle_provider_errors decorator

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant