Feature: translation cache #1289

SnowMasaya · 2025-07-09T05:02:56Z

This PR implements an enhanced translation caching system for Garak's language providers. The changes include:

Feature Enhancements

Enhanced cache storage to include original text alongside translation for better debugging and data integrity
Added get_cache_entry() method to retrieve full cache entries with metadata
Implemented automatic caching across all translation providers (local and remote)
Added backward compatibility with old cache format

Integration

Updated all translation classes (Passthru, LocalHFTranslator, RivaTranslator, DeeplTranslator, GoogleTranslator) to use the caching system
Added _translate_with_cache() and _translate_impl() methods for consistent caching behavior
Maintained existing error handling and retry logic for remote services

Benefits

Performance: Significantly reduces translation time for repeated text
Cost savings: Reduces API calls to paid services like DeepL, Google Cloud Translation, and NVIDIA Riva
Reliability: Provides fallback for offline scenarios when cached translations are available
Consistency: Ensures identical translations for the same input text across different runs

Verification

List the steps needed to make sure this thing works

Run the tests and ensure they pass: python -m pytest tests/langservice/test_translation_cache.py
Run integration tests: python -m pytest tests/langservice/test_translation_cache_integration.py
Verify cache files are created in ~/.cache/garak/translation/ with correct naming pattern
Document the caching system in docs/source/translation.rst

Cache File Verification

Check cache file naming: translation_cache_{source_lang}{target_lang}{model_type}.json

Error Handling Verification

Test with corrupted cache files
Test with missing API keys
Test with invalid language pairs
Verify graceful fallback behavior

The caching system is transparent to users and requires no additional configuration. It automatically activates when translation services are used and provides significant benefits for performance and cost reduction.

- Enhance cache storage to include original text alongside translation for better debugging and data integrity - Add get_cache_entry() method to retrieve full cache entries with metadata - Implement _translate_with_cache() method in LangProvider for automatic caching This provides a robust caching system that improves performance, reduces API costs, and enhances debugging capabilities for all translation services.

- Update LocalHFTranslator to use _translate_with_cache for automatic caching - Add _translate_impl method to LocalHFTranslator for non-cached translation - Update RivaTranslator, DeepLTranslator, GoogleTranslator to use _translate_with_cache for automatic caching This enables caching for translation services (Riva, DeepL, Google, local) while maintaining existing error handling and retry logic, significantly reducing API costs and improving performance for repeated translations.

- Create test subclasses that set required attributes before calling parent constructor - Add comprehensive mocking for API key validation and provider loading - Add integration tests for translation caching system - Test cache persistence between translator instances

- Add detailed caching documentation explaining benefits and usage - Document cache file naming convention and storage location

jmartin-tech

Awesome addition, this will be very useful and gets the project one step closer to being able to support user provided human translation as well.

A few change are requested, and I added a few notes for possible future proofing.

garak/langproviders/local.py

tests/langservice/test_translation_cache_integration.py

jmartin-tech · 2025-07-09T13:05:27Z

garak/langproviders/base.py

+    @property
+    def cache(self):
+        return self._cache
+
+    @property
+    def cache_file_path(self):
+        return self.cache_file


These should not need to be exposed, the internally held values should not be accessed publicly, tests can evaluate the private objects.

garak/langproviders/base.py

tests/langservice/test_translation_cache.py

docs/source/translation.rst

…ache - Changed the type hint for the provider argument in TranslationCache to "LangProvider" (as a string) to safely reference a class defined later in the same file - change save file name - make hash code change

- remove extra pass thru test - check save file name

Signed-off-by: Jeffrey Martin <[email protected]>

Comments addressed as of 5ba91f6, additional core team review is pending.

jmartin-tech

One more minor revision.

This looks pretty usable as an iteration towards more efficient additional language support.

jmartin-tech · 2025-07-30T18:18:55Z

garak/langproviders/base.py

+        """Translate text with caching support."""
+        cached_translation = self.cache.get(text)
+        if cached_translation is not None:
+            logging.debug(f"Using cached translation for text: {text[:50]}...")


This should not log the original text, these values can source from prompts or LLM response and could produce escape sequences that could impact something that is monitoring the logs.

Suggested change

logging.debug(f"Using cached translation for text: {text[:50]}...")

I could see however keeping a tracker in the service for cache hit/miss rates. While not something we need to expose at this time having the metric values in the runtime could be helpful during debugging scenarios.

erickgalinkin

Looks good. I think there are a few areas for performance improvements but no objections to merging as-is.

erickgalinkin · 2025-08-07T13:36:22Z

garak/langproviders/base.py

+    def get_cache_key(self, text: str) -> str:
+        return hashlib.md5(text.encode("utf-8"), usedforsecurity=False).hexdigest()


This seems like potentially a lot of overhead. Approach is, IMO, fine for now but perhaps we should consider partial hashing or something? Given potentially many thousands of strings that are often very long, we could consider a faster cache lookup approach in the future.

Most case is short sentence. If we support long sentence, we consider about "partial hashing" in the future.

I thought this and measured hashlib.md5 speed. It seems pretty fine

Alternative is to use the raw string as key and rely on the underlying index's hashing (e.g. dict keys). This halves the amount of hashing. Might bump into length limits though ¯_ (ツ)_/¯

erickgalinkin · 2025-08-07T13:53:20Z

garak/langproviders/base.py

+    def get(self, text: str) -> str | None:
+        cache_key = self.get_cache_key(text)
+        cache_entry = self._cache.get(cache_key)


Since we're going to have a ton of cache misses on first invocation, does it make sense for us to have an easier way to avoid cache misses? Maybe just a check if the cache is empty?

The cache is built in real-time and many of the probes result in hits based on responses very quickly in the first run. These checks that result in miss even in the first run seem worth the cost IMO.

erickgalinkin · 2025-08-07T14:01:49Z

garak/langproviders/base.py

+        return hashlib.md5(text.encode("utf-8"), usedforsecurity=False).hexdigest()
+
+    def get(self, text: str) -> str | None:
+        cache_key = self.get_cache_key(text)


Suggested change

cache_key = self.get_cache_key(text)

if not len(self._cache):

return None

cache_key = self.get_cache_key(text)

Would save us a ton of hashing. At some point in the future, we could even add hierarchy and do it by probe or something so that we simply avoid generating a ton of hashes to look up in an empty cache for no reason.

The current implementation is not caching full prompts or responses but each sentence, I do think in a future revision we will see some optimization to add here.

+1 for skipping on empty cache

leondz

It is connected to the right places and the general form is good.

Would prefer a much simpler cache with minimal duplication, where the cache is a simple keyed object that uses the source text and the key and has the translated text as the value. A single dict can do this.

leondz · 2025-09-08T11:04:41Z

docs/source/translation.rst

+**How it works:**
+
+- Each translation pair (source language → target language) gets its own cache file
+- Cache files are stored in JSON format under the cache directory: ``{cache_dir}/translation/translation_cache_{source_lang}_{target_lang}_{model_type}_{model_name}.json``


do model_type and model_name refer to the translator model? what do API translates look like?

leondz · 2025-09-08T11:29:29Z

garak/langproviders/base.py

+class TranslationCache:
+    def __init__(self, provider: "LangProvider"):
+        if not hasattr(provider, "model_type"):
+            return None  # providers without a model_type do not have a cache


why? is this expressing a non-explicit convention in LangProvider naming? if so would prefer this to be expressed explicitly

This is checking an internal detail of the implemented object, we could add some other defining factor, however at this time only the PassThru class meets such criteria.

We should add some other defining factor. I can open an issue or we can do it here. What would you prefer? A class attrib?

leondz · 2025-09-08T11:34:28Z

garak/langproviders/base.py

+
+        cache_dir = _config.transient.cache_dir / "translation"
+        cache_dir.mkdir(mode=0o740, parents=True, exist_ok=True)
+        cache_filename = f"translation_cache_{self.source_lang}_{self.target_lang}_{self.model_type}_{self.model_name.replace('/', '_')}.json"


Prefer dbm backend for this vs. a large inflight dict with occasional serdes (i guess that word kinda fits here?) action. Can be a fast follow

leondz · 2025-09-08T11:35:16Z

garak/langproviders/base.py

+        cache_dir.mkdir(mode=0o740, parents=True, exist_ok=True)
+        cache_filename = f"translation_cache_{self.source_lang}_{self.target_lang}_{self.model_type}_{self.model_name.replace('/', '_')}.json"
+        self.cache_file = cache_dir / cache_filename
+        logging.info(f"Cache file: {self.cache_file}")


needs verb + qualification

Suggested change

logging.info(f"Cache file: {self.cache_file}")

logging.info(f"Loading translation cache file: {self.cache_file}")

leondz · 2025-09-08T11:37:52Z

garak/langproviders/base.py

+    def get_cache_key(self, text: str) -> str:
+        return hashlib.md5(text.encode("utf-8"), usedforsecurity=False).hexdigest()


I thought this and measured hashlib.md5 speed. It seems pretty fine

Alternative is to use the raw string as key and rely on the underlying index's hashing (e.g. dict keys). This halves the amount of hashing. Might bump into length limits though ¯_ (ツ)_/¯

leondz · 2025-09-08T11:48:13Z

garak/langproviders/base.py

+            "source_lang": self.source_lang,
+            "target_lang": self.target_lang,
+            "model_type": self.model_type,
+            "model_name": self.model_name,


this info is redundant - it's logged in the cache object's filename - suggest it is deleted

leondz · 2025-09-08T11:48:34Z

garak/langproviders/base.py

+        cache_key = self.get_cache_key(text)
+        self._cache[cache_key] = {


Let's just use source text itself as the key

Would this be guaranteed to be consistently viable? Using a hash provides a known consistent format that will also be cross platform compatible (as no OS specific escaping of the alphanumeric hash strings would apply). While in theory all python versions should use consistent internal representations of a strings that assumption is not a guaranteed and lines in a prompt could contain something odd, in fact the this is a core expectation as we test the edges of things.

It's as consistently viable as the underlying object. Python dict requires that keys are hashable. Python str is a hashable type. Max str length is predicated on arch; for 32-bit it's ~2.1GB.

So yes, consistently viable. Not universally viable but we will have Other Problems (and many of them) before it breaks.

Python already hashes the keys we're using (strings) and does all the hard work putting them into hashed lookups (dicts). No need to duplicate that work.

leondz · 2025-09-08T11:48:59Z

garak/langproviders/base.py

+            "model_type": self.model_type,
+            "model_name": self.model_name,
+        }
+        self._save_cache()


this means a Lot of serialisation and saving!

leondz · 2025-09-08T11:51:21Z

garak/langproviders/base.py

+    def get_cache_entry(self, text: str) -> dict | None:
+        """Get full cache entry including original text and metadata."""
+        cache_key = self.get_cache_key(text)
+        cache_entry = self._cache.get(cache_key)
+        if cache_entry and isinstance(cache_entry, dict):
+            return cache_entry
+        elif isinstance(cache_entry, str):
+            # Backward compatibility with old format
+            return {
+                "original": text,
+                "translation": cache_entry,
+                "source_lang": self.source_lang,
+                "target_lang": self.target_lang,
+                "model_type": self.model_type,
+                "model_name": self.model_name,
+            }
+        return None


Let's not have backward compatibility in the first PR landing a feature

Let's simplify the cache entires by removing information we have elsewhere

Let's index on the text directly, halving the amount of hashing done

Suggested change

def get_cache_entry(self, text: str) -> dict | None:

"""Get full cache entry including original text and metadata."""

cache_key = self.get_cache_key(text)

cache_entry = self._cache.get(cache_key)

if cache_entry and isinstance(cache_entry, dict):

return cache_entry

elif isinstance(cache_entry, str):

# Backward compatibility with old format

return {

"original": text,

"translation": cache_entry,

"source_lang": self.source_lang,

"target_lang": self.target_lang,

"model_type": self.model_type,

"model_name": self.model_name,

}

return None

def get_cache_entry(self, text: str) -> str | None:

return self._cache.get(text)

leondz · 2025-09-08T11:54:38Z

garak/langproviders/base.py

+            logging.debug(f"Using cached translation for text: {text[:50]}...")
+            return cached_translation
+        translation = self._translate_impl(text)
+        self.cache.set(text, translation)


Can we move to a direct src->dst cache? i.e.

Suggested change

self.cache.set(text, translation)

self.cache.[text] = translation

masayaOgushi added 4 commits July 9, 2025 13:47

Add: Add documentation for translation caching

6645fba

- Add detailed caching documentation explaining benefits and usage - Document cache file naming convention and storage location

jmartin-tech changed the title ~~Feature/translation cache~~ Feature: translation cache Jul 9, 2025

jmartin-tech previously requested changes Jul 9, 2025

View reviewed changes

masayaOgushi and others added 7 commits July 15, 2025 20:25

fix: save file name

df32c36

fix: remove passthru with cache

25d9104

fix: remove extra test

fc262b9

- remove extra pass thru test - check save file name

fix: black check

a3b0748

fix: remove extra test

593bc8e

suppress cache file for PassThru provider

5ba91f6

Signed-off-by: Jeffrey Martin <[email protected]>

jmartin-tech reviewed Aug 7, 2025

View reviewed changes

erickgalinkin approved these changes Aug 7, 2025

View reviewed changes

jmartin-tech requested a review from leondz September 5, 2025 14:42

leondz requested changes Sep 8, 2025

View reviewed changes

		def get_cache_key(self, text: str) -> str:
		return hashlib.md5(text.encode("utf-8"), usedforsecurity=False).hexdigest()

	logging.info(f"Cache file: {self.cache_file}")
	logging.info(f"Loading translation cache file: {self.cache_file}")

		cache_key = self.get_cache_key(text)
		self._cache[cache_key] = {

	self.cache.set(text, translation)
	self.cache.[text] = translation

Feature: translation cache #1289

Are you sure you want to change the base?

Feature: translation cache #1289

Conversation

SnowMasaya commented Jul 9, 2025

Feature Enhancements

Integration

Benefits

Verification

Cache File Verification

Error Handling Verification

Uh oh!

jmartin-tech left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jmartin-tech left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

erickgalinkin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

leondz left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

leondz Sep 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

leondz Sep 8, 2025 •

edited

Loading