dol design notes #53
Replies: 2 comments
-
Mutable Mapping Interface for Filesystems and Blob Storage - Data Type FlexibilityIntroduction The design of a Consider the following examples: # Flexible bytes store
s = FileBytes(...)
# writing bytes, reading bytes
s['some_key'] = b'some_bytes'
assert s['some_key'] == b'some_bytes'
# writing strings, reading back bytes
s['another_key'] = 'more_bytes' # note we're writing a string here...
assert s['another_key'] == b'more_bytes' # ... but we get bytes back
# Flexible text store
t = FileText(...)
# writing strings, reading strings
t['text_key'] = 'some_text'
assert t['text_key'] == 'some_text'
# writing bytes, reading strings
t['byte_key'] = b'some_bytes' # note we're writing bytes here...
assert t['byte_key'] == 'some_bytes' # ... but we get a string back (decoded with UTF-8)In the Pros and Cons of Flexible Design
Best Practices and Design Philosophy
Is This Approach Used in Other Libraries?
Decisions After careful consideration, it is proposed that we adhere to the explicit design for the initial implementation. This approach offers a more robust foundation, mitigating the potential pitfalls associated with the hybrid solution. "Premature optimization is the root of all evil (or at least most of it) in programming." - Donald Knuth. Users who require automatic conversions can implement value codec wrappers to achieve the desired behavior. Should this pattern become prevalent, we can further streamline the process by providing utility classes or factory methods. This choice enables us to have a more solid base and avoiding the cons involved in the hybrid solution without a clear knowledge of how useful the flexibility will be. |
Beta Was this translation helpful? Give feedback.
-
Compare a few encoders (on size)import pickle, json, pandas as pd, numpy as np
import dol
texts = ['Some text that will be repeated many times.'] * 1000
vectors = [[0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.8, 0.9]] * 1000
pickle_encode = pickle.dumps
json_encode = json.dumps
numpy_encode = dol.written_bytes(np.save, obj_arg_position_in_writer=1)
parquet_encode = dol.written_bytes(pd.DataFrame.to_parquet, obj_arg_position_in_writer=0)
def single_column_parquet_encode(datas):
return pd.DataFrame({'data': datas}).to_parquet()
print(pd.DataFrame([
{'name': 'pickle', 'type': 'texts', 'size': len(pickle_encode(texts))},
{'name': 'json', 'type': 'texts', 'size': len(json_encode(texts))},
{'name': 'parquet', 'type': 'texts', 'size': len(parquet_encode(pd.DataFrame(texts)))},
{'name': 'single_column_parquet_encode', 'type': 'texts', 'size': len(single_column_parquet_encode(texts))},
{'name': 'numpy', 'type': 'texts', 'size': len(numpy_encode(np.array(texts)))},
{'name': 'pickle', 'type': 'vectors', 'size': len(pickle_encode(vectors))},
{'name': 'json', 'type': 'vectors', 'size': len(json_encode(vectors))},
{'name': 'parquet', 'type': 'vectors', 'size': len(parquet_encode(pd.DataFrame(vectors)))},
{'name': 'single_column_parquet_encode', 'type': 'vectors', 'size': len(single_column_parquet_encode(vectors))},
{'name': 'numpy', 'type': 'vectors', 'size': len(numpy_encode(np.array(vectors)))}
]).sort_values('size').reset_index(drop=True).to_markdown(index=False))
|
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
To accumulate design notes.
Beta Was this translation helpful? Give feedback.
All reactions