Skip to content

Fix caching calls to _vector_for_key_cached and _out_of_vocab_vector_cached#47

Open
zfang wants to merge 5 commits intoplasticityai:masterfrom
zfang:master
Open

Fix caching calls to _vector_for_key_cached and _out_of_vocab_vector_cached#47
zfang wants to merge 5 commits intoplasticityai:masterfrom
zfang:master

Conversation

@zfang
Copy link
Copy Markdown

@zfang zfang commented Feb 7, 2019

  1. _query_is_cached will always returns False because key should be in a tuple.

  2. lru_cache is able to unify args, kwargs, and default args in a call with the get_default_args magic in order to generate a consistent cache key. What this means is that
    a. all the default args will be part of kwargs;
    b. any args with a default value will also be converted to kwargs.
    c. for a parameter that has no default value, if you provide it as args in one call and as kwargs in another, they will have different cache keys.
    Therefore _out_of_vocab_vector_cached._cache.get(((key,), frozenset([('normalized', normalized)]))) will always return False since the actual cache key is ((key,), frozenset([('normalized', normalized), ('force', force)]))

  3. It's wasteful to call _cache.get and throw away the result. So I changed _query_is_cached to _query_cached.

zfang added 2 commits February 6, 2019 21:16
`_query_is_cached` will always returns false because `_cache.get` expects `key` to be in a tuple. This renders the caching useless.
Fix _query_is_cached to allow caching
@zfang
Copy link
Copy Markdown
Author

zfang commented Feb 9, 2019

I'm surprised that this PR receives no attention because it improves performance of our service by a large margin. Here is a code snippet to help understand the effect of this change:

from collections import defaultdict

import pandas as pd
from pymagnitude import *

words = ['hello', 'world', 'cache', 'helllloooo', 'wooooorlddddd', 'caaaaache', ]
reversed_words = list(reversed(words))

vector = Magnitude(path=MagnitudeUtils.download_model('glove/medium/glove.twitter.27B.25d', log=True),
                   language='en',
                   lazy_loading=2400000)
vector_attrs = ['query', '_vector_for_key_cached', '_out_of_vocab_vector_cached']


def log_cached(vector):
    data = defaultdict(list)
    cache_attrs = ['size', 'lookups', 'hits', 'misses', 'evictions']
    for attr in vector_attrs:
        for cache_attr in cache_attrs:
            data[cache_attr].append(getattr(getattr(vector, attr)._cache, cache_attr))
    df = pd.DataFrame(data, index=vector_attrs)
    print(df, '\n')


print('### Query ...')
vector.query(words)
log_cached(vector)

print('### Query reverse ...')
vector.query(reversed_words)
log_cached(vector)

Output before the change:

### Query ...
                                size  lookups  hits  misses  evictions
query                           1000        1     0       1          0
_vector_for_key_cached       2400000        6     0       6          0
_out_of_vocab_vector_cached  2400000        9     0       9          0 

### Query reverse ...
                                size  lookups  hits  misses  evictions
query                           1000        2     0       2          0
_vector_for_key_cached       2400000       12     0      12          0
_out_of_vocab_vector_cached  2400000       18     3      15          0 

Output after the change:

### Query ...
                                size  lookups  hits  misses  evictions
query                           1000        1     0       1          0
_vector_for_key_cached       2400000        6     0       6          0
_out_of_vocab_vector_cached  2400000        9     0       9          0 

### Query reverse ...
                                size  lookups  hits  misses  evictions
query                           1000        2     0       2          0
_vector_for_key_cached       2400000       12     6       6          0
_out_of_vocab_vector_cached  2400000       12     3       9          0 

I also created https://github.com/zfang/benchmark_pymagnitude just for testing my patch.

@zfang zfang changed the title Fix _query_is_cached to actually enable caching Fix caching calls to _vector_for_key_cached and _out_of_vocab_vector_cached Feb 9, 2019
@AjayP13
Copy link
Copy Markdown
Contributor

AjayP13 commented Feb 12, 2019

Hi @zfang,

Thanks for this PR, this likely broke at some point when I modified the underlying LRU cache. Sorry, I've been on travel for the last week or so, I'll get around to reviewing this tonight and merging this in the next few days.

I'll also add some tests to make sure the cache works and prevent regressions in the future.

@AjayP13 AjayP13 self-assigned this Feb 12, 2019
@AjayP13 AjayP13 added bug Something isn't working caching labels Feb 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working caching

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants