Strategy for safe integration of new version of mc-providers (with direct ES support) #936

philbudne · 2025-01-15T18:32:28Z

N.B. This is a super-issue that may involve work (sub-issues) in both web-search (and two different branches of mc-providers), but I'm posting it here 'cause it's too big for a slack message!

Thoughts on ES-based mc-providers/web-search integration.

To mitigate risks that falling back to news-search-api will be
necessary, make it easy to fall back (and hope that this
preparation dissipates heisenbugs and other evil spirits, and makes it
unnecessary to fall back):

Add a run-time (constance?) emergency pull-cord, checked in
web-search/mcweb/backend/search/utils.py pq_provider function that
(if the cord is pulled) directs any requests for the "mediacloud"
provider to "mediacloud-old" (the NSA provider in the new version
of the library).
In the more severe case that even the "old" provider in the new
mc-providers library doesn't work well enough, make a 2.2.1
release of the current mc-providers library that accepts:

a. new/extra args to provider_by_name (software_id, session_id)
b. "mediacloud-old" name as an alias for "mediacloud"

So that fall back to the (very) lightly modified version
of the old library by editing of web-search/requirements.txt is
possible.

Some known integration issues/areas with the ES based provider
OnlineNewsMediaCloudESProvider:

Returns empty results when an error is detected.

One such error condition is when a query fails for one or more
shards (circuit breaker or other error); N-S-A returns partial
results, the ESProvider detects the error, logs it, and returns
empty results. I've seen that ES returns such errors when heap
utilization is high.

We either need to:
a. Understand why this happens
i. It's possible it's due to "top terms" aggregation queries,
which often fail due to circuit breaker errors.
A temporary run-time switch in web-search to turn
of top terms might help us determine if this is the case.
ii. It might be possible to try to infer possible causes
looking at monitoring data (but that's a lot of conditionals
in a row!)
b. Hope that the new cluster hardware and removal of fielddata
cures the problem, make it possible to have the provider return
partial results (ignore errors), and have a temporary
(environment variable based) switch to silently return data
from queries which retruned incomplete data (failed for some shards).
Exception handling

a. There may be situations where errors detected by ESProvider that
should be reported by raising an Exception rather than silently
returning empty results (see above).

b. There are almost certainly exceptions raised by
the elasticsearch library, including (but not limited to):

elasticsearch.BadRequestError: BadRequestError(400, 'search_phase_execution_exception', 'parent task was cancelled [Fatal failure during search: failed to merge result [[parent] Data too large, data for [<reduce_aggs>] would be [30615713720/28.5gb], which is larger than the limit of [30601641984/28.5gb], real usage: [30615712304/28.5gb], new bytes reserved: [1416/1.3kb], usages [eql_sequence=0/0b, fielddata=5418974373/5gb, request=2131429/2mb, inflight_requests=6744/6.5kb, model_inference=0/0b]]]')

and

elasticsearch.ApiError: ApiError(429, 'circuit_breaking_exception', '[parent] Data too large, data for [<http_request>] would be [30641554912/28.5gb], which is larger than the limit of [30601641984/28.5gb], real usage: [30641548168/28.5gb], new bytes reserved: [6744/6.5kb], usages [eql_sequence=0/0b, fielddata=10441141167/9.7gb, request=0/0b, inflight_requests=6744/6.5kb, model_inference=0/0b]')

Such errors will need to be handled in ways that provide useful/helpful/comprehendable errors to web-search users
(while also logging detailed information for developers to look at to try and diganose the problem).

It's up to debate whether this should be done by catching
errors in mc-providers re-raising exceptions with user-consumable
messages, or whether the catching and translation goes into
web-search middleware.

The former better isolates concerns, but it complicates
remediation of problems (requires two components to be
changed/released).

At the present, web-search is the only software that calls
mc-providers search methods, so it's not completely
unreasonable to (at least initially) perform the error
translations in web-search (SO LONG AS THEY'RE WELL
LOCALIZED, AND NOT SPREAD FAR AND WIDE!!!)

The text was updated successfully, but these errors were encountered:

philbudne · 2025-01-16T02:21:52Z

Correction, provider exceptions are handled by the @handle_provider_errors decorator

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Strategy for safe integration of new version of mc-providers (with direct ES support) #936

Strategy for safe integration of new version of mc-providers (with direct ES support) #936

philbudne commented Jan 15, 2025

philbudne commented Jan 16, 2025

Strategy for safe integration of new version of mc-providers (with direct ES support) #936

Strategy for safe integration of new version of mc-providers (with direct ES support) #936

Comments

philbudne commented Jan 15, 2025

philbudne commented Jan 16, 2025