dol design notes #53

thorwhalen · 2025-03-11T10:51:53Z

thorwhalen
Mar 11, 2025
Maintainer

To accumulate design notes.

thorwhalen · 2025-03-11T10:52:17Z

thorwhalen
Mar 11, 2025
Maintainer Author

Mutable Mapping Interface for Filesystems and Blob Storage - Data Type Flexibility

Introduction

The design of a MutableMapping interface for interacting with filesystems and blob storage presents a crucial decision regarding data type handling. Traditionally, such interfaces operate in either a strict byte or text mode, demanding explicit encoding and decoding by the user. However, a more flexible approach, automatically handling type conversions, could enhance usability. This proposal explores the merits and drawbacks of this flexible design.

Consider the following examples:

# Flexible bytes store
s = FileBytes(...)
# writing bytes, reading bytes
s['some_key'] = b'some_bytes'
assert s['some_key'] == b'some_bytes'

# writing strings, reading back bytes
s['another_key'] = 'more_bytes'  # note we're writing a string here...
assert s['another_key'] == b'more_bytes'  # ... but we get bytes back

# Flexible text store
t = FileText(...)
# writing strings, reading strings
t['text_key'] = 'some_text'
assert t['text_key'] == 'some_text'

# writing bytes, reading strings
t['byte_key'] = b'some_bytes' # note we're writing bytes here...
assert t['byte_key'] == 'some_bytes' # ... but we get a string back (decoded with UTF-8)

In the FileBytes example, writing a string automatically encodes it to bytes using a default encoding (e.g., UTF-8). Conversely, in FileText, writing bytes automatically decodes them to a string. This design aims to reduce boilerplate and improve user experience.

Pros and Cons of Flexible Design

Pros:
- Improved User Experience: The interface becomes more forgiving, reducing the burden of manual type conversions.
- Reduced Boilerplate: Users are spared the tedium of explicit encoding and decoding.
- Increased Accessibility: The library becomes more approachable for users with varying levels of expertise.
Cons:
- Implicit Behavior: Automatic conversions can lead to unexpected results if users are unaware of the underlying mechanisms, potentially violating the principle of least astonishment.
- Potential for Errors: Incorrect default encodings or unintended byte-to-string conversions can result in data corruption.
- Increased Complexity: The code base becomes more complex, potentially hindering debugging.

Best Practices and Design Philosophy

Explicit is Better Than Implicit: As stated in the Zen of Python, clarity and predictability are paramount.
Principle of Least Surprise: The library should behave in a manner consistent with user expectations.
Robust Error Handling: Comprehensive error handling is crucial to mitigate potential encoding/decoding issues.

Is This Approach Used in Other Libraries?

While some libraries, particularly those dealing with web requests or serialization, provide automatic encoding/decoding, low-level file system or storage libraries typically adhere to strict type handling.
Many libraries will make a best guess, and if that best guess fails, then they raise an error.

Decisions

After careful consideration, it is proposed that we adhere to the explicit design for the initial implementation. This approach offers a more robust foundation, mitigating the potential pitfalls associated with the hybrid solution.

"Premature optimization is the root of all evil (or at least most of it) in programming." - Donald Knuth.

Users who require automatic conversions can implement value codec wrappers to achieve the desired behavior. Should this pattern become prevalent, we can further streamline the process by providing utility classes or factory methods.

This choice enables us to have a more solid base and avoiding the cons involved in the hybrid solution without a clear knowledge of how useful the flexibility will be.

0 replies

thorwhalen · 2025-03-31T10:06:11Z

thorwhalen
Mar 31, 2025
Maintainer Author

Compare a few encoders (on size)

import pickle, json, pandas as pd, numpy as np
import dol 

texts = ['Some text that will be repeated many times.'] * 1000
vectors = [[0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.8, 0.9]] * 1000

pickle_encode = pickle.dumps
json_encode = json.dumps
numpy_encode = dol.written_bytes(np.save, obj_arg_position_in_writer=1)
parquet_encode = dol.written_bytes(pd.DataFrame.to_parquet, obj_arg_position_in_writer=0)

def single_column_parquet_encode(datas):
    return pd.DataFrame({'data': datas}).to_parquet()

print(pd.DataFrame([
    {'name': 'pickle', 'type': 'texts', 'size': len(pickle_encode(texts))},
    {'name': 'json', 'type': 'texts', 'size': len(json_encode(texts))},
    {'name': 'parquet', 'type': 'texts', 'size': len(parquet_encode(pd.DataFrame(texts)))},
    {'name': 'single_column_parquet_encode', 'type': 'texts', 'size': len(single_column_parquet_encode(texts))},
    {'name': 'numpy', 'type': 'texts', 'size': len(numpy_encode(np.array(texts)))},
    {'name': 'pickle', 'type': 'vectors', 'size': len(pickle_encode(vectors))},
    {'name': 'json', 'type': 'vectors', 'size': len(json_encode(vectors))},
    {'name': 'parquet', 'type': 'vectors', 'size': len(parquet_encode(pd.DataFrame(vectors)))},
    {'name': 'single_column_parquet_encode', 'type': 'vectors', 'size': len(single_column_parquet_encode(vectors))},
    {'name': 'numpy', 'type': 'vectors', 'size': len(numpy_encode(np.array(vectors)))}
]).sort_values('size').reset_index(drop=True).to_markdown(index=False))

name	type	size
parquet	texts	1654
single_column_parquet_encode	texts	1718
pickle	texts	2060
single_column_parquet_encode	vectors	2060
pickle	vectors	2092
parquet	vectors	5533
json	vectors	45000
json	texts	47000
numpy	vectors	72128
numpy	texts	172128

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dol design notes #53

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

dol design notes #53

Uh oh!

thorwhalen Mar 11, 2025 Maintainer

Replies: 2 comments

Uh oh!

Uh oh!

thorwhalen Mar 11, 2025 Maintainer Author

Mutable Mapping Interface for Filesystems and Blob Storage - Data Type Flexibility

Uh oh!

Uh oh!

thorwhalen Mar 31, 2025 Maintainer Author

Compare a few encoders (on size)

thorwhalen
Mar 11, 2025
Maintainer

thorwhalen
Mar 11, 2025
Maintainer Author

thorwhalen
Mar 31, 2025
Maintainer Author