Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DATATYPE] Refactor data-types #2562

Open
7 of 8 tasks
blythed opened this issue Oct 25, 2024 · 3 comments
Open
7 of 8 tasks

[DATATYPE] Refactor data-types #2562

blythed opened this issue Oct 25, 2024 · 3 comments
Labels
Milestone

Comments

@blythed blythed added this to the 0.5.0 milestone Oct 25, 2024
@blythed
Copy link
Collaborator Author

blythed commented Oct 28, 2024

  1. Artifact and File should not change depending on CFG.bytes_encoding. This is because they are writing to file.
  2. ...

@jieguangzhou
Copy link
Collaborator

jieguangzhou commented Nov 4, 2024

The superduper-framework only provides the most basic datatypes:

  • Native
    • int
    • str
    • bytes
    • dict (json)
  • Vector
  • python_obj (default: pickle)

Plan 1

For different databackends, we can configure datatype conversion relationships, such as:

  • ibis:

    • dict → json
    • vector → sqlvector
  • postgres:

    • vector → pgvector

Thus, datatype conversions will occur in the following areas:

  • create_table
  • insert data
  • query data

Example: JSON

insert data

input_data = {"data": {"a": "b"}}
input_schema = Schema({"data": "dict"})

schema = _convert_schema(input_schema)
## schema = Schema({"data": "json"})

encode_data = schema.encode(input_data)
encode_data = {"data": '{"a": "b"}'}

query data

input_data = {"data": '{"a": "b"}'}
input_schema = Schema({"data": "dict"})

schema = _convert_schema(input_schema)
## schema = Schema({"data": "json"})

decode_data = schema.decode(input_data)
decode_data = {"data": {"a": "b"}}

Plan 2:

We also define this through configuration files within the preset datatype definitions.

datatypes:
  vector: ibis.datatype.sql_datatype
  # or vector: postgresql.datatype.sqldatatype
  dict: json
class Vector:
    def __post__init(self):
        datatype_config = CFG.xxxx
        obj = cls(...)
        if self.class .__name__ in datatype_config:
            cls_real_datatype =  # import the real_datatype class
            self.real_datatype = cls_real_datatype(...)

        else:
            self.real_datatype = None

    def encode_data(...):
        return (self.real_datatype or self).encode_data()

    def decode_data(...):
        return (self.real_datatype or self).encode_data()

@jieguangzhou
Copy link
Collaborator

jieguangzhou commented Nov 4, 2024

Regarding CFG.bytes_encoding, if it’s set to base64, then for all datatype.encode_data output:

If the data is bytes, convert it to base64 and add the prefix BASE64:.

Then, during decode_data:

If the input data is a str with the BASE64: prefix, convert it back to bytes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: No status
Development

No branches or pull requests

2 participants