Skip to content

1.47.0

Latest
Compare
Choose a tag to compare
@github-actions github-actions released this 16 Jan 03:09

Merged PRs

dolt

  • 8749: Allow importing parquet fields containing repeated elements.
    NOTE: This still needs tests. I'm looking for a good tool for generating parquet. We can't use dolt table export to generate the parquet because we can't generate composite types that way.
    This PR adds support for importing specific composite parquet types into Dolt. Specifically, we're now able to import a compose parquet field if:
    • There is exactly one leaf column in the field.
    • There is at most one repeated tag in the field.
      We flatten these composite values into a single primitive value (if there are no repeated tags) or an array of primitive values (if there's exactly one repeated tag.)
      There's more work to be done here (multidimensional arrays, objects, etc), but this allows us to import vector embedding stored in parquet files.

    Why do we flatten the type?

    We want to be able to import parquet files from HuggingFace, and store embedding sequences as arrays. Embedding sequences in HuggingFace exports are an optional field containing a single repeated child field, which itself contains a single optional field containing the sequence element. Flattening this into a single array is more usable and doesn't lose any data.
  • 8686: Proximity Map implementation with support for incremental edits.
    Based on #8408, now with additional functionality for incremental changes to indexes.
    This is a large-scale PR merging several features into main, all designed for supporting vector indexes.

    Vector Index Nodes

    1defec9 adds a new message/node type: the vector index node. This message stores a node in a Merkle tree index whose structure is based on some distance measure in a multi-dimensional space: at each level, keys are arranged such that a key is closer to its parent key than any other key in the parent node.
    One consequence of this design is that it's not possible to put a hard limit on the number of keys contained in each node. We can control the mean node size, but there's always a non-zero chance that a node will be large enough to break our usual encoding scheme (which uses 16-bit ints to store message offsets). To address this, the vector index node uses 32-bit ints to store message offsets instead of the 16 bits used by other node types.

    Proximity Map

    A ProximityMap is a new implementation of Dolt's Map, a data structure built on Merkle trees that maps key bytestrings to value bytestrings. The ProximityMap is backed by a tree of vector index nodes, allowing it to perform an approximate nearest neighbor search.
    Proximity Maps resemble other Prolly Maps, but have the following invariants:
    • Each key must be convertible to a vector. Typically, the key is a val.Tuple, and the vector is the first value in that tuple.
    • The keys are arranged in the tree such that, for each of a key's parent keys (the keys that appear on the path from the root to the key), the key is closer to that parent key than any of the parent key's siblings.
    • The keys in a node are sorted lexographically (note that this is not necessarily the same ordering as the tuple that the key represents), except for the first key which matches its direct parent.
      Notably, while the keys of an individual node are sorted, walking all of a vector indexes keys in standard iteration order will not be sorted.
      28b7065 and 6b91635 contain the bulk of the ProximityMap implementation.
      The bulk of the changes are in these three commits. Each of the other commits is a smaller self-contained change necessary to support vector indexes.

go-mysql-server

  • 2817: Use vector index when the SELECT cause has a projection.
    Due to some overly strict pattern matching in the vector index selection, we weren't always using the index when there was a projection involved: we were only applying the index in the presence of a TopN node, but we also weren't generating TopN nodes in the case we had a Limit -> Project -> Sort node structure.
    I was hoping that dolthub/go-mysql-server#2813 would fix this, and I suspect there's improvements to GMS that would make this unnecessary. But for now, we should allow the pattern matching in replaceIdxOrderByDistance to apply a vector index lookup in this case.
  • 2816: Allow using vector index when the queried vector is provided in a user variable.
    Right now, vector indexes are very narrowly applied. One of the inputs to the DISTANCE function needs to be a constant. Before we required it to be a Literal expression, but UserVar expressions should also work.
  • 2797: Persist and load superusers
    Previously, superusers were persisted to disk, but never loaded back again when the database was restarted. This essentially made all superusers ephemeral, since they only lasted for the duration of a SQL server process.
    This change loads persisted superusers from disk, and also adds a new function to create ephemeral superusers that do not get persisted to disk.
    This also includes a fix for the event scheduler to use a privileged account so that it can load events from all databases.

Closed Issues

  • 8734: Can't delete remote branch refs that no longer exist in origin