Skip to content

Conversation

SnowMasaya
Copy link
Contributor

This PR implements an enhanced translation caching system for Garak's language providers. The changes include:

Feature Enhancements

  • Enhanced cache storage to include original text alongside translation for better debugging and data integrity
  • Added get_cache_entry() method to retrieve full cache entries with metadata
  • Implemented automatic caching across all translation providers (local and remote)
  • Added backward compatibility with old cache format

Integration

  • Updated all translation classes (Passthru, LocalHFTranslator, RivaTranslator, DeeplTranslator, GoogleTranslator) to use the caching system
  • Added _translate_with_cache() and _translate_impl() methods for consistent caching behavior
  • Maintained existing error handling and retry logic for remote services

Benefits

  • Performance: Significantly reduces translation time for repeated text
  • Cost savings: Reduces API calls to paid services like DeepL, Google Cloud Translation, and NVIDIA Riva
  • Reliability: Provides fallback for offline scenarios when cached translations are available
  • Consistency: Ensures identical translations for the same input text across different runs

Verification

List the steps needed to make sure this thing works

  • Run the tests and ensure they pass: python -m pytest tests/langservice/test_translation_cache.py
  • Run integration tests: python -m pytest tests/langservice/test_translation_cache_integration.py
  • Verify cache files are created in ~/.cache/garak/translation/ with correct naming pattern
  • Document the caching system in docs/source/translation.rst

Cache File Verification

  • Check cache file naming: translation_cache_{source_lang}{target_lang}{model_type}.json

Error Handling Verification

  • Test with corrupted cache files
  • Test with missing API keys
  • Test with invalid language pairs
  • Verify graceful fallback behavior

The caching system is transparent to users and requires no additional configuration. It automatically activates when translation services are used and provides significant benefits for performance and cost reduction.

- Enhance cache storage to include original text alongside translation for
  better debugging and data integrity
- Add get_cache_entry() method to retrieve full cache entries with metadata
- Implement _translate_with_cache() method in LangProvider for automatic caching

This provides a robust caching system that improves performance, reduces
API costs, and enhances debugging capabilities for all translation services.
- Update LocalHFTranslator to use _translate_with_cache for automatic caching
- Add _translate_impl method to LocalHFTranslator for non-cached translation

- Update RivaTranslator, DeepLTranslator, GoogleTranslator to use _translate_with_cache for automatic caching

This enables caching for translation services (Riva, DeepL, Google, local)
while maintaining existing error handling and retry logic, significantly
reducing API costs and improving performance for repeated translations.
- Create test subclasses that set required attributes before calling parent constructor
- Add comprehensive mocking for API key validation and provider loading
- Add integration tests for translation caching system
- Test cache persistence between translator instances
- Add detailed caching documentation explaining benefits and usage
- Document cache file naming convention and storage location
@jmartin-tech jmartin-tech changed the title Feature/translation cache Feature: translation cache Jul 9, 2025
Copy link
Collaborator

@jmartin-tech jmartin-tech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome addition, this will be very useful and gets the project one step closer to being able to support user provided human translation as well.

A few change are requested, and I added a few notes for possible future proofing.

Comment on lines 207 to 213
@property
def cache(self):
return self._cache

@property
def cache_file_path(self):
return self.cache_file
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should not need to be exposed, the internally held values should not be accessed publicly, tests can evaluate the private objects.

masayaOgushi and others added 7 commits July 15, 2025 20:25
…ache

- Changed the type hint for the provider argument in TranslationCache to "LangProvider" (as a string) to safely reference a class defined later in the same file
- change save file name
- make hash code change
- remove extra pass thru test
- check save file name
@jmartin-tech jmartin-tech dismissed their stale review July 28, 2025 21:04

Comments addressed as of 5ba91f6, additional core team review is pending.

Copy link
Collaborator

@jmartin-tech jmartin-tech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more minor revision.

This looks pretty usable as an iteration towards more efficient additional language support.

"""Translate text with caching support."""
cached_translation = self.cache.get(text)
if cached_translation is not None:
logging.debug(f"Using cached translation for text: {text[:50]}...")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not log the original text, these values can source from prompts or LLM response and could produce escape sequences that could impact something that is monitoring the logs.

Suggested change
logging.debug(f"Using cached translation for text: {text[:50]}...")

I could see however keeping a tracker in the service for cache hit/miss rates. While not something we need to expose at this time having the metric values in the runtime could be helpful during debugging scenarios.

Copy link
Collaborator

@erickgalinkin erickgalinkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I think there are a few areas for performance improvements but no objections to merging as-is.

Comment on lines +180 to +181
def get_cache_key(self, text: str) -> str:
return hashlib.md5(text.encode("utf-8"), usedforsecurity=False).hexdigest()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like potentially a lot of overhead. Approach is, IMO, fine for now but perhaps we should consider partial hashing or something? Given potentially many thousands of strings that are often very long, we could consider a faster cache lookup approach in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most case is short sentence. If we support long sentence, we consider about "partial hashing" in the future.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought this and measured hashlib.md5 speed. It seems pretty fine

Alternative is to use the raw string as key and rely on the underlying index's hashing (e.g. dict keys). This halves the amount of hashing. Might bump into length limits though ¯_ (ツ)_/¯

Comment on lines +183 to +185
def get(self, text: str) -> str | None:
cache_key = self.get_cache_key(text)
cache_entry = self._cache.get(cache_key)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're going to have a ton of cache misses on first invocation, does it make sense for us to have an easier way to avoid cache misses? Maybe just a check if the cache is empty?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cache is built in real-time and many of the probes result in hits based on responses very quickly in the first run. These checks that result in miss even in the first run seem worth the cost IMO.

return hashlib.md5(text.encode("utf-8"), usedforsecurity=False).hexdigest()

def get(self, text: str) -> str | None:
cache_key = self.get_cache_key(text)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
cache_key = self.get_cache_key(text)
if not len(self._cache):
return None
cache_key = self.get_cache_key(text)

Would save us a ton of hashing. At some point in the future, we could even add hierarchy and do it by probe or something so that we simply avoid generating a ton of hashes to look up in an empty cache for no reason.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current implementation is not caching full prompts or responses but each sentence, I do think in a future revision we will see some optimization to add here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for skipping on empty cache

@jmartin-tech jmartin-tech requested a review from leondz September 5, 2025 14:42
Copy link
Collaborator

@leondz leondz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is connected to the right places and the general form is good.

Would prefer a much simpler cache with minimal duplication, where the cache is a simple keyed object that uses the source text and the key and has the translated text as the value. A single dict can do this.

**How it works:**

- Each translation pair (source language → target language) gets its own cache file
- Cache files are stored in JSON format under the cache directory: ``{cache_dir}/translation/translation_cache_{source_lang}_{target_lang}_{model_type}_{model_name}.json``
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do model_type and model_name refer to the translator model? what do API translates look like?

class TranslationCache:
def __init__(self, provider: "LangProvider"):
if not hasattr(provider, "model_type"):
return None # providers without a model_type do not have a cache
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why? is this expressing a non-explicit convention in LangProvider naming? if so would prefer this to be expressed explicitly

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is checking an internal detail of the implemented object, we could add some other defining factor, however at this time only the PassThru class meets such criteria.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add some other defining factor. I can open an issue or we can do it here. What would you prefer? A class attrib?


cache_dir = _config.transient.cache_dir / "translation"
cache_dir.mkdir(mode=0o740, parents=True, exist_ok=True)
cache_filename = f"translation_cache_{self.source_lang}_{self.target_lang}_{self.model_type}_{self.model_name.replace('/', '_')}.json"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer dbm backend for this vs. a large inflight dict with occasional serdes (i guess that word kinda fits here?) action. Can be a fast follow

cache_dir.mkdir(mode=0o740, parents=True, exist_ok=True)
cache_filename = f"translation_cache_{self.source_lang}_{self.target_lang}_{self.model_type}_{self.model_name.replace('/', '_')}.json"
self.cache_file = cache_dir / cache_filename
logging.info(f"Cache file: {self.cache_file}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs verb + qualification

Suggested change
logging.info(f"Cache file: {self.cache_file}")
logging.info(f"Loading translation cache file: {self.cache_file}")

Comment on lines +180 to +181
def get_cache_key(self, text: str) -> str:
return hashlib.md5(text.encode("utf-8"), usedforsecurity=False).hexdigest()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought this and measured hashlib.md5 speed. It seems pretty fine

Alternative is to use the raw string as key and rely on the underlying index's hashing (e.g. dict keys). This halves the amount of hashing. Might bump into length limits though ¯_ (ツ)_/¯

Comment on lines +198 to +201
"source_lang": self.source_lang,
"target_lang": self.target_lang,
"model_type": self.model_type,
"model_name": self.model_name,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this info is redundant - it's logged in the cache object's filename - suggest it is deleted

Comment on lines +194 to +195
cache_key = self.get_cache_key(text)
self._cache[cache_key] = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's just use source text itself as the key

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this be guaranteed to be consistently viable? Using a hash provides a known consistent format that will also be cross platform compatible (as no OS specific escaping of the alphanumeric hash strings would apply). While in theory all python versions should use consistent internal representations of a strings that assumption is not a guaranteed and lines in a prompt could contain something odd, in fact the this is a core expectation as we test the edges of things.

Copy link
Collaborator

@leondz leondz Sep 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's as consistently viable as the underlying object. Python dict requires that keys are hashable. Python str is a hashable type. Max str length is predicated on arch; for 32-bit it's ~2.1GB.

So yes, consistently viable. Not universally viable but we will have Other Problems (and many of them) before it breaks.

Python already hashes the keys we're using (strings) and does all the hard work putting them into hashed lookups (dicts). No need to duplicate that work.

"model_type": self.model_type,
"model_name": self.model_name,
}
self._save_cache()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this means a Lot of serialisation and saving!

Comment on lines +205 to +221
def get_cache_entry(self, text: str) -> dict | None:
"""Get full cache entry including original text and metadata."""
cache_key = self.get_cache_key(text)
cache_entry = self._cache.get(cache_key)
if cache_entry and isinstance(cache_entry, dict):
return cache_entry
elif isinstance(cache_entry, str):
# Backward compatibility with old format
return {
"original": text,
"translation": cache_entry,
"source_lang": self.source_lang,
"target_lang": self.target_lang,
"model_type": self.model_type,
"model_name": self.model_name,
}
return None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not have backward compatibility in the first PR landing a feature

Let's simplify the cache entires by removing information we have elsewhere

Let's index on the text directly, halving the amount of hashing done

Suggested change
def get_cache_entry(self, text: str) -> dict | None:
"""Get full cache entry including original text and metadata."""
cache_key = self.get_cache_key(text)
cache_entry = self._cache.get(cache_key)
if cache_entry and isinstance(cache_entry, dict):
return cache_entry
elif isinstance(cache_entry, str):
# Backward compatibility with old format
return {
"original": text,
"translation": cache_entry,
"source_lang": self.source_lang,
"target_lang": self.target_lang,
"model_type": self.model_type,
"model_name": self.model_name,
}
return None
def get_cache_entry(self, text: str) -> str | None:
return self._cache.get(text)

logging.debug(f"Using cached translation for text: {text[:50]}...")
return cached_translation
translation = self._translate_impl(text)
self.cache.set(text, translation)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move to a direct src->dst cache? i.e.

Suggested change
self.cache.set(text, translation)
self.cache.[text] = translation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants