Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement dot_product, cosine_similarity/matrix for tensor ops; add shape mismatch e… #242

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

Force1ess
Copy link

feat: implement vector operations and error handling

Resolves #136

  • Add dot_product function for tensor multiplication
  • Implement cosine_similarity and cosine_distance metrics
  • Add ShapeMismatch error type for tensor dimension validation
  • Include comprehensive tests with edge cases for all operations

This supersedes previous work in #138 and #157

@Force1ess
Copy link
Author

Hi @edgarriba ,
I've implemented the optimized cosine_similarity function as described in the blog post and successfully reproduced comparable performance results. The benchmarks confirm significant speed improvements on my Mac M1 Pro, as shown in the attached screenshots.
One question: Since this optimized implementation doesn't use the dot_product function, should we remove that function from the codebase to reduce redundancy?

image image

/// # Errors
///
/// If the shapes of the tensors don't match, an error is returned.
pub fn cosine_similarity_optimized<T, const N: usize, A>(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great ! So what do you think about providing a low level kernel function like: fn cosine_similariy_kernel_float<T>(a: &[T], b: &[T]) -> T. This will make easy to have a very clean low level api to expose to raw python types (numpy comes with an overhead to pass data from/to) and avoid for now to expose the whole Tensor API. We can do this in a future PR

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds great to me!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, should I put it in the 'tensor-ops' crate too?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, create a new mod kernels and place it there. This is my suggestion

fn dot_product1_float_kernel<T>(a: &[T], b: &[T]) -> T
where
    T: Zero + Copy + Clone + std::ops::Add<Output = T> + std::ops::Mul<Output = T>,
{
    let mut result = T::zero();
    for (a_val, b_val) in a.iter().zip(b.iter()) {
        result = result + *a_val * *b_val;
    }
    result
}

or

fn product1_float_kernel<T>(a: &[T], b: &[T]) -> T
where
    T: Zero + Copy + Clone + std::ops::Add<Output = T> + std::ops::Mul<Output = T>,
{
    let result = a
        .iter()
        .zip(b.iter())
        .fold(T::zero(), |acc, (a_val, b_val)| acc + *a_val * *b_val);
    result
}

probably the second should leverage better rust internals to vectorise when it's possible

@edgarriba
Copy link
Member

Hi @edgarriba , I've implemented the optimized cosine_similarity function as described in the blog post and successfully reproduced comparable performance results. The benchmarks confirm significant speed improvements on my Mac M1 Pro, as shown in the attached screenshots. One question: Since this optimized implementation doesn't use the dot_product function, should we remove that function from the codebase to reduce redundancy?

image image

Results are amazing ! I’m pretty we can even optimise a bit more later. I would like to keep the two functions, specially if we can make them very performant as they are very useful for ml in general

@Force1ess
Copy link
Author

Force1ess commented Mar 8, 2025

Hi @edgarriba , I've implemented the optimized cosine_similarity function as described in the blog post and successfully reproduced comparable performance results. The benchmarks confirm significant speed improvements on my Mac M1 Pro, as shown in the attached screenshots. One question: Since this optimized implementation doesn't use the dot_product function, should we remove that function from the codebase to reduce redundancy?
image image

Results are amazing ! I’m pretty we can even optimise a bit more later. I would like to keep the two functions, specially if we can make them very performant as they are very useful for ml in general

Hi, I've finished the improvements we mentioned before.
Here's the newest benchmark result:
image

@Force1ess
Copy link
Author

By the way, I am pretty interested in optimizing these functions under your guides.
If you have any ideas, please feel free to let me know.

@edgarriba
Copy link
Member

By the way, I am pretty interested in optimizing these functions under your guides. If you have any ideas, please feel free to let me know.

lot's of things to do. Curating all this low level kernels needed for ml could be an interesting direction and then efficiently exposing to python

/// # Errors
///
/// If the shapes of the tensors don't match, an error is returned.
pub fn cosine_similarity_optimized<T, const N: usize, A>(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, create a new mod kernels and place it there. This is my suggestion

fn dot_product1_float_kernel<T>(a: &[T], b: &[T]) -> T
where
    T: Zero + Copy + Clone + std::ops::Add<Output = T> + std::ops::Mul<Output = T>,
{
    let mut result = T::zero();
    for (a_val, b_val) in a.iter().zip(b.iter()) {
        result = result + *a_val * *b_val;
    }
    result
}

or

fn product1_float_kernel<T>(a: &[T], b: &[T]) -> T
where
    T: Zero + Copy + Clone + std::ops::Add<Output = T> + std::ops::Mul<Output = T>,
{
    let result = a
        .iter()
        .zip(b.iter())
        .fold(T::zero(), |acc, (a_val, b_val)| acc + *a_val * *b_val);
    result
}

probably the second should leverage better rust internals to vectorise when it's possible

// Calculate cosine similarity: dot_product/(sqrt(magnitude_a)*sqrt(magnitude_b))
let denominator = magnitude_a.sqrt() * magnitude_b.sqrt();
Ok(dot_product / denominator)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same kernel strategy here

fn cosine_similarity_kernel<T>(a: &[T], b: &[T]) -> T
where
    T: num_traits::Float,
{
    let (dot_product, magnitude_a, magnitude_b) = a.iter().zip(b.iter()).fold(
        (T::zero(), T::zero(), T::zero()),
        |(dot_product, magnitude_a, magnitude_b), (a_val, b_val)| {
            let a = *a_val;
            let b = *b_val;
            (
                dot_product + a * b,
                magnitude_a + a * a,
                magnitude_b + b * b,
            )
        },
    );

    if magnitude_a == T::zero() || magnitude_b == T::zero() {
        return T::zero();
    }

    let denominator = magnitude_a.sqrt() * magnitude_b.sqrt();
    dot_product / denominator
}

@Force1ess
Copy link
Author

Hi @edgarriba ,
I've relocated the dependency as per your guides.
Is there anything else we should address in this PR?

@Force1ess
Copy link
Author

Force1ess commented Mar 9, 2025

By the way, I am pretty interested in optimizing these functions under your guides. If you have any ideas, please feel free to let me know.

lot's of things to do. Curating all this low level kernels needed for ml could be an interesting direction and then efficiently exposing to python

While I plan to work on implementing low-level kernels for machine learning.
Maybe I should focus on resolving GSoC-related issues first.

@edgarriba
Copy link
Member

By the way, I am pretty interested in optimizing these functions under your guides. If you have any ideas, please feel free to let me know.

lot's of things to do. Curating all this low level kernels needed for ml could be an interesting direction and then efficiently exposing to python

While I plan to work on implementing low-level kernels for machine learning. Maybe I should focus on resolving GSoC-related issues first.

definitely ! but feel free to open any ticket if you think of some other interesting things to be done later. Maybe other can help too. BTW, are you in the Discord?

@Force1ess
Copy link
Author

Force1ess commented Mar 9, 2025

By the way, I am pretty interested in optimizing these functions under your guides. If you have any ideas, please feel free to let me know.

lot's of things to do. Curating all this low level kernels needed for ml could be an interesting direction and then efficiently exposing to python

While I plan to work on implementing low-level kernels for machine learning. Maybe I should focus on resolving GSoC-related issues first.

definitely ! but feel free to open any ticket if you think of some other interesting things to be done later. Maybe other can help too. BTW, are you in the Discord?

Of course, my nickname is forceless!
Moreover, please check if there's anything else we should complete in this PR.

@edgarriba
Copy link
Member

let just move the internal functions in a kernel.rs and good to go

@Force1ess
Copy link
Author

let just move the internal functions in a kernel.rs and good to go

Hi, I've implemented and placed them in the kernel crate.

@edgarriba
Copy link
Member

ok, update then the tensor-ops to use the new kernels functions

Copy link
Member

@edgarriba edgarriba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check that the test pass in you local machine

@Force1ess
Copy link
Author

@edgarriba, I've used the kernel functions and all tests passed in my machine

image image

check that the test pass in you local machine

@@ -11,4 +12,12 @@ pub enum TensorOpsError {
/// Tensor error
#[error("Error with the tensor: {0}")]
TensorError(#[from] TensorError),

/// Tensor error
#[error("Error with the kernel: {0}")]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#[error("Error with the kernel: {0}")]
#[transparent]

I recently learnt this trick, to forward directly the error

fn test_cosine_similarity_3d_tensors() -> Result<(), TensorOpsError> {
let a = Tensor::<f32, 3, CpuAllocator>::from_shape_slice(
[2, 2, 2],
&[1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now seeing this test, I think we should limit also to Tensor1 this operator, as here i don't think it makes a lot of sense as it's applying to the whole tensor instead to a specific axis dimension ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add cosine similarity metric
2 participants