[DATATYPE] Refactor data-types #2562

blythed · 2024-10-25T09:56:45Z

Issue is that not all output types are supported by every database, and to support certain operations, the data needs to be in the correct format.

@kartik4949 @jieguangzhou to provide input.

The text was updated successfully, but these errors were encountered:

blythed · 2024-10-28T09:31:34Z

Artifact and File should not change depending on CFG.bytes_encoding. This is because they are writing to file.
...

jieguangzhou · 2024-11-04T08:15:18Z

The superduper-framework only provides the most basic datatypes:

Native
- int
- str
- bytes
- dict (json)
- …
Vector
python_obj (default: pickle)

Plan 1

For different databackends, we can configure datatype conversion relationships, such as:

ibis:
- dict → json
- vector → sqlvector
postgres:
- vector → pgvector

Thus, datatype conversions will occur in the following areas:

create_table
insert data
query data

Example: JSON

insert data

input_data = {"data": {"a": "b"}}
input_schema = Schema({"data": "dict"})

schema = _convert_schema(input_schema)
## schema = Schema({"data": "json"})

encode_data = schema.encode(input_data)
encode_data = {"data": '{"a": "b"}'}

query data

input_data = {"data": '{"a": "b"}'}
input_schema = Schema({"data": "dict"})

schema = _convert_schema(input_schema)
## schema = Schema({"data": "json"})

decode_data = schema.decode(input_data)
decode_data = {"data": {"a": "b"}}

Plan 2:

We also define this through configuration files within the preset datatype definitions.

datatypes:
  vector: ibis.datatype.sql_datatype
  # or vector: postgresql.datatype.sqldatatype
  dict: json

class Vector:
    def __post__init(self):
        datatype_config = CFG.xxxx
        obj = cls(...)
        if self.class .__name__ in datatype_config:
            cls_real_datatype =  # import the real_datatype class
            self.real_datatype = cls_real_datatype(...)

        else:
            self.real_datatype = None

    def encode_data(...):
        return (self.real_datatype or self).encode_data()

    def decode_data(...):
        return (self.real_datatype or self).encode_data()

jieguangzhou · 2024-11-04T08:28:22Z

Regarding CFG.bytes_encoding, if it’s set to base64, then for all datatype.encode_data output:

If the data is bytes, convert it to base64 and add the prefix BASE64:.

Then, during decode_data:

If the input data is a str with the BASE64: prefix, convert it back to bytes.

blythed added this to the 0.5.0 milestone Oct 25, 2024

blythed added the epic label Nov 11, 2024

blythed added this to superduper-open-source Nov 14, 2024

blythed mentioned this issue Nov 19, 2024

[DATATYPE] Create base DataType without all parameters and use it for "switch" Vector type #2629

Closed

blythed modified the milestones: 0.5.0, 0.6.0 Nov 21, 2024

blythed removed this from superduper-open-source Nov 21, 2024

blythed modified the milestones: 0.6.0, 0.5.0 Nov 27, 2024

blythed added this to superduper-open-source Nov 27, 2024

blythed closed this as completed Feb 8, 2025

github-project-automation bot moved this to Done in superduper-open-source Feb 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DATATYPE] Refactor data-types #2562

[DATATYPE] Refactor data-types #2562

blythed commented Oct 25, 2024 •

edited

Loading

blythed commented Oct 28, 2024

jieguangzhou commented Nov 4, 2024 •

edited

Loading

jieguangzhou commented Nov 4, 2024 •

edited

Loading

[DATATYPE] Refactor data-types #2562

[DATATYPE] Refactor data-types #2562

Comments

blythed commented Oct 25, 2024 • edited Loading

blythed commented Oct 28, 2024

jieguangzhou commented Nov 4, 2024 • edited Loading

Plan 1

Plan 2:

jieguangzhou commented Nov 4, 2024 • edited Loading

blythed commented Oct 25, 2024 •

edited

Loading

jieguangzhou commented Nov 4, 2024 •

edited

Loading

jieguangzhou commented Nov 4, 2024 •

edited

Loading