Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request for improved/updated beginner examples #378

Open
aeshirey opened this issue Feb 11, 2025 · 4 comments
Open

Request for improved/updated beginner examples #378

aeshirey opened this issue Feb 11, 2025 · 4 comments

Comments

@aeshirey
Copy link

I'm looking to use linfa for k-means clustering, and the current k-means example is pretty incomprehensible to a newbie. It may be that this makes perfect sense to someone steeped in this API or even in ndarray, but to me, the issues are:

  • The current version of rand (0.9.0 as of Feb 2025) appears to be incompatible with the version used in the example
  • Generating random data from a PRNG doesn't help when my goal is to load data from somewhere else. How can I create a mutable data structure that I can push new vectors onto?
  • DatasetBase indicates that it contains records and maybe targets, weights, and feature names. I have no clue what the target/weights are when I'm trying to create input.
  • Not having expected centroids, I'd like to lean on the API to either generate something random, something evenly distributed, or a use some quick heuristic otherwise.

Ultimately, my ideal is to do something like:

let mut records = Dataset::with_capacity(100_000); // expected number of input rows
for row in load_my_data("file.tsv") {
    // where 'row' is, say, a [f64; 5] or a Vec<f32>?
    records.push(row);
}

let initial_state = kmeans::generate_random_centroids(10 /* # clusters */, &records);

let clusters = kmeans::params_with(...).fit(&records);

for (id, cluster) in clusters.iter().enumerate() {
    // presumably cluster is [f64; 5] or &[f32]
    println!("Cluster {id} located @ {cluster:?}");
}

I realize this may diverge drastically from what currently exists, but I'd like to determine how to bridge this gap. Thanks!

@quietlychris
Copy link
Member

I'm not sure that I fully understand what the ask here is, but will attempt to answer some of these one at a time:

The current version of rand (0.9.0 as of Feb 2025) appears to be incompatible with the version used in the example

Rust crates don't necessarily track the latest release of a crate, but instead use versions pinned via the Cargo.toml file, in this case, denoted by rand = { version = "0.8", features = ["small_rng"] }. Additionally, as you've noted, the new 0.9 release for rand is extremely recent (two weeks), after having been on 0.8 for several years. Most of the ecosystem has not updated yet, and that will likely be a slow process, including for linfa.

Generating random data from a PRNG doesn't help when my goal is to load data from somewhere else. How can I create a mutable data structure that I can push new vectors onto?

You can take a look at some simple examples on how to load .csv files in linfa-datasets. While these are convenience methods, linfa is more focused on machine learning algorithms, not on data manipulation. You may find it useful to do data manipulation directly via ndarray, or in polars and then use something like the to_ndarray method

DatasetBase indicates that it contains records and maybe targets, weights, and feature names. I have no clue what the target/weights are when I'm trying to create input.

Most of this is pretty standard terminology within the data science/machine learning field, and there's a number of resources available across the internet that can help you with these terms. In order to stay tractable, linfa's own documentation generally doesn't seek to define each of these terms from the ground up, and assumes some level of familiarity with the subject matter. If you have specific suggestions on how to improve the clarity of the documentation on DatasetBase, we're certainly open to reviewing pull requests that pertain to that area.

Not having expected centroids, I'd like to lean on the API to either generate something random, something evenly distributed, or a use some quick heuristic otherwise.

I'm not sure I follow here. Clustering algorithms, including K-Means, typically find clusters within a dataset, and those clusters are parameterized by centroids. linfa's API structure is fairly well-defined by this point; while we're open to good-faith discussion on how to make it more effective, it may be ultimately more efficient to create a wrapper library around the specific APIs you need while you're figuring those patterns out rather than directly modifying upstream.

I realize this may diverge drastically from what currently exists, but I'd like to determine how to bridge this gap. Thanks!

As mentioned, we're happy to review good-faith contributions from the greater community. In order to be most effective at this, I'd suggest taking a few minutes to review the contributing guide.

@bremby
Copy link

bremby commented Feb 13, 2025

@quietlychris I'm another newbie and I'm wondering something similar, but regarding the OPTICS algorithm. The documentation for that is extremely sparse to a point that I have to look at source code to figure out what I'm supposed to do. Please don't take it personally/the wrong way, I am also posting this in good faith. But I've been looking at OPTICS since yesterday now and I'm still unsure. Especially I don't know why the implementation expects Array2 when I'm supplying my own distance function. I'm not a Rust expert either, btw. I'm looking at the transform method for OpticsValidParams and it specifies F to be Float. It goes even deeper on, when the default nn_algo is KdTree. The KdTreeIndex::new() method also requires the data element to be Float for no obvious reason (not obvious to me at least).
What I have is a Vec of structs representing 2D points on a sphere. Yes, they are 2D so it looks like I shouldn't have a problem, but I'm spending a lot of time figuring out a way to convert these different representations into what ndarray understands as Array2 (and that would be a gripe with the ndarray library, not linfa) just so that I satisfy linfa's requirement. To me it would seem that I should be able to just have any type as the input and just supply a distance function (which I have to do here anyway, because distances on a globe are not just planar Euclidean). I'm probably missing something, but isn't this all that the theoretical description of OPTICS requires?

@quietlychris
Copy link
Member

As a suggestion, it's generally easier for someone to help with something like this by referencing a piece of actual code, rather than back out your particular situation from prose.

Especially I don't know why the implementation expects Array2 when I'm supplying my own distance function.

If you could create and link a repository with the code you have, and link to the lines that you're having trouble with, that would be helpful--for example, I'm not exactly sure where you're trying to pass in your own distance function, or what that looks like. This is, in general, good practice and:

  • Allows developers to pinpoint specific issues that you might be having, and limits the likelihood that people are misinterpreting one another
  • Helps to demonstrate that you've spent some time attempting to work through the problem on your own
  • Creates a reference in the future that other folks could potentially use, and/or creates a starting place for building out examples/writing the sort of documentation that you're interested in

What I have is a Vec of structs representing 2D points on a sphere. Yes, they are 2D so it looks like I shouldn't have a problem, but I'm spending a lot of time figuring out a way to convert these different representations into what ndarray understands as Array2

If you have something like Point { x: f32, y: f32} , and are currently representing your data as Vec<Point>, you'll almost certainly need to write a function for converting that into ndarray::Array2<f32>. You can probably compare the way that both linfa_datasets::generate_blobs() and the OPTICS test functions (marked with #[test], starting around here) are managing this problem to see if that helps.

@bremby
Copy link

bremby commented Feb 14, 2025

@quietlychris Thanks for the reply. I didn't really mean to share actual code, because I didn't want to bother others trying to help me debug my code. I only wanted to support OP by giving my own perspective. I wanted to stress out how difficult it is to get into using this library (I'm not saying it's bad, it's just difficult to get going).

Anyway, if your offer to help stands, here's the code I'm currently struggling with. I've already managed to translate the Vec into Array2, that's fine. But I struggle with the dist function.

...
    // this may not be correct. Depends on whether the array is assembled by columns or rows first
    let dataset: Array2<f64> = Array2::from_shape_vec((2, 2), original_points.into_iter().map(|p| [p.x, p.y]).flatten().collect::<Vec<_>>()).unwrap(); 

    // this is from the optics example from linfa
    // Configure our training algorithm
    let min_points = 3;
    let analysis = Optics::params_with(min_points, GlobeDist, CommonNearestNeighbour::KdTree)
        .tolerance(3.0)
        .transform(dataset.view())
        .unwrap();

    println!("Result: ");
    for sample in analysis.iter() {
        println!("{:?}", sample);
    }
}

// and this is my own
fn globe_distance(_: f64, _: f64, _: f64, _: f64) -> f64 {
    unimplemented!()
}

#[derive(Clone)]
struct GlobeDist;

impl Distance<f64> for GlobeDist {
    fn distance<D: Dimension>(&self, a: ArrayView<f64, D>, b: ArrayView<f64, Ix2>) -> f64 {
        globe_distance(a[[0,1]], a[[0,0]], b[[0,1]], b[[0,0]])
    }
}

I'm struggling with the type parameter D and Ix2. Notice that parameters a and b have different type signature. The first one just uses D, the other one Ix2. The first one causes an error on the first two parameters of other_distance(), the latter is incompatible with D. I looked at how L2Dist is implemented, but that one uses some l2_dist method on the ArrayView. That method is not within linfa nor ndarray and I have no idea where it comes from or what it looks like.

Image

I wanted to provide a MWE, but I don't even know how to compile your optics example and I can't share my codebase (proprietary). I'm hoping I won't need to set up an entire new project just to produce a MWE.
Apologies for being inexperienced.

Edit:
I spent more time trying to understand what's going on. If I understand correctly, the GlobeDist struct must implement Distance, but that trait requires me to implement the distance<D: Dimension> function for any Dimension. But that doesn't seem possible, because Rust doesn't support type specialization, so I can't implement distance<Ix2> and distance<anything but Ix2>. The only way I would be able to do that would have to be using some iterators over rows within the generic ArrayBase struct - I'd calculate correctly on row length equal to 2 and fail/panic on everything else. Is this the only way forward?

Edit2: In case someone finds this, the previous paragraph is correct. I was able to implement generic GlobeDist by iterating over individual elements of the first (and only) row in the arrayviews. I would panic if there was less than two items and would ignore any after the first two.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants