Fix caching calls to _vector_for_key_cached and _out_of_vocab_vector_cached#47
Open
zfang wants to merge 5 commits intoplasticityai:masterfrom
Open
Fix caching calls to _vector_for_key_cached and _out_of_vocab_vector_cached#47zfang wants to merge 5 commits intoplasticityai:masterfrom
_vector_for_key_cached and _out_of_vocab_vector_cached#47zfang wants to merge 5 commits intoplasticityai:masterfrom
Conversation
`_query_is_cached` will always returns false because `_cache.get` expects `key` to be in a tuple. This renders the caching useless.
Fix _query_is_cached to allow caching
Author
|
I'm surprised that this PR receives no attention because it improves performance of our service by a large margin. Here is a code snippet to help understand the effect of this change: from collections import defaultdict
import pandas as pd
from pymagnitude import *
words = ['hello', 'world', 'cache', 'helllloooo', 'wooooorlddddd', 'caaaaache', ]
reversed_words = list(reversed(words))
vector = Magnitude(path=MagnitudeUtils.download_model('glove/medium/glove.twitter.27B.25d', log=True),
language='en',
lazy_loading=2400000)
vector_attrs = ['query', '_vector_for_key_cached', '_out_of_vocab_vector_cached']
def log_cached(vector):
data = defaultdict(list)
cache_attrs = ['size', 'lookups', 'hits', 'misses', 'evictions']
for attr in vector_attrs:
for cache_attr in cache_attrs:
data[cache_attr].append(getattr(getattr(vector, attr)._cache, cache_attr))
df = pd.DataFrame(data, index=vector_attrs)
print(df, '\n')
print('### Query ...')
vector.query(words)
log_cached(vector)
print('### Query reverse ...')
vector.query(reversed_words)
log_cached(vector)Output before the change: Output after the change: I also created https://github.com/zfang/benchmark_pymagnitude just for testing my patch. |
_query_is_cached to actually enable caching_vector_for_key_cached and _out_of_vocab_vector_cached
…vocab_vector_cached._cache.get call
Contributor
|
Hi @zfang, Thanks for this PR, this likely broke at some point when I modified the underlying LRU cache. Sorry, I've been on travel for the last week or so, I'll get around to reviewing this tonight and merging this in the next few days. I'll also add some tests to make sure the cache works and prevent regressions in the future. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
_query_is_cachedwill always returnsFalsebecausekeyshould be in a tuple.lru_cacheis able to unifyargs,kwargs, and default args in a call with theget_default_argsmagic in order to generate a consistent cache key. What this means is thata. all the default
argswill be part ofkwargs;b. any
argswith a default value will also be converted tokwargs.c. for a parameter that has no default value, if you provide it as
argsin one call and askwargsin another, they will have different cache keys.Therefore
_out_of_vocab_vector_cached._cache.get(((key,), frozenset([('normalized', normalized)])))will always returnFalsesince the actual cache key is((key,), frozenset([('normalized', normalized), ('force', force)]))It's wasteful to call
_cache.getand throw away the result. So I changed_query_is_cachedto_query_cached.