Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: support all data types listed in scalar_kind_t in index_dense #469

Open
2 of 3 tasks
mbautin opened this issue Aug 22, 2024 · 1 comment
Open
2 of 3 tasks
Labels
enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed

Comments

@mbautin
Copy link
Contributor

mbautin commented Aug 22, 2024

Describe what you are looking for

index_dense seems to only support these data types:

    add_result_t add(vector_key_t key, b1x8_t const* vector, std::size_t thread = any_thread(), bool force_vector_copy = true) { return add_(key, vector, thread, force_vector_copy, casts_.from_b1x8); }
    add_result_t add(vector_key_t key, i8_t const* vector, std::size_t thread = any_thread(), bool force_vector_copy = true) { return add_(key, vector, thread, force_vector_copy, casts_.from_i8); }
    add_result_t add(vector_key_t key, f16_t const* vector, std::size_t thread = any_thread(), bool force_vector_copy = true) { return add_(key, vector, thread, force_vector_copy, casts_.from_f16); }
    add_result_t add(vector_key_t key, f32_t const* vector, std::size_t thread = any_thread(), bool force_vector_copy = true) { return add_(key, vector, thread, force_vector_copy, casts_.from_f32); }
    add_result_t add(vector_key_t key, f64_t const* vector, std::size_t thread = any_thread(), bool force_vector_copy = true) { return add_(key, vector, thread, force_vector_copy, casts_.from_f64); }

The casts are instantiated only for these 5 types as well.

scalar_kind_t has much more than that:

enum class scalar_kind_t : std::uint8_t {
    unknown_k = 0,
    // Custom:
    b1x8_k = 1,
    u40_k = 2,
    uuid_k = 3,
    // Common:
    f64_k = 10,
    f32_k = 11,
    f16_k = 12,
    f8_k = 13,
    // Common Integral:
    u64_k = 14,
    u32_k = 15,
    u16_k = 16,
    u8_k = 17,
    i64_k = 20,
    i32_k = 21,
    i16_k = 22,
    i8_k = 23,
};

In particular, the SIFT1B dataset from http://corpus-texmex.irisa.fr/ seems to require unsigned char, and building a float-based index for that dataset seems wasteful.

If there is a good reason to only support those 5 data types, some documentation of the rationale would be great.

Can you contribute to the implementation?

  • I can contribute

Is your feature request specific to a certain interface?

C++ implementation

Contact Details

[email protected]

Is there an existing issue for this?

  • I have searched the existing issues

Code of Conduct

  • I agree to follow this project's Code of Conduct
@mbautin mbautin added the enhancement New feature or request label Aug 22, 2024
ashvardanian added a commit that referenced this issue Aug 29, 2024
Separating vector-casting logic will make it easier
to extend the type system, potentially adding `bf16`
and `u8` high-level `add`, `get`,  and`search` APIs
down the road.

Related to #469

Co-authored-by: Mikhail Bautin <[email protected]>
@ashvardanian
Copy link
Contributor

Some of those, indeed, make sense, @mbautin! Implementing all of them sounds like a code-bloat. How about adding only u8 and bf16 to casts_punned_t?

I've just pushed a commit that refactors the code logic making it easier to extend an API. Feel free to open PRs 🤗

@ashvardanian ashvardanian added good first issue Good for newcomers help wanted Extra attention is needed labels Sep 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants