Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NaN value of features #7

Open
akharroubi opened this issue Jun 17, 2023 · 2 comments
Open

NaN value of features #7

akharroubi opened this issue Jun 17, 2023 · 2 comments
Assignees
Labels
enhancement New feature or request

Comments

@akharroubi
Copy link

For NaN values generated by CloudCompare (when choosing a fixed radius), I see 2 possible solutions:

  1. Filter these values before reading the file, or interpolate these values from neighboring points, otherwise do the classification without them and interpolate the classification afterward.

  2. Or, if there are no points within a radius r, switch the method to feature calculation based on nearest neighbors.

@Yarroudh Yarroudh added the enhancement New feature or request label Jun 19, 2023
@Yarroudh Yarroudh self-assigned this Jun 19, 2023
@Yarroudh
Copy link
Owner

I'll be working on that this week. Thanks @akharroubi.

@Yarroudh
Copy link
Owner

Yarroudh commented Jul 21, 2023

I've been exploring the missing values in RF classifier and I think there are some options:

  • Completely drop NaN values and train the model (not recommanded).
  • Fill in the missing values with median, mean, or mode.
  • Estimates missing features using nearest samples.

In scikit-learn, there is a class sklearn.impute.SimpleImputer that replace missing values using a descriptive statistic (e.g. mean, median, or most frequent) along each column, or using a constant value. There is also sklearn.impute.KNNImputer that complete missing values using k-Nearest Neighbors.

I'm also working on resolving large datasets memory saturation. For reading the data, I'm using now chunks reading as implemented in laspy. For training the model, I think Batch Learning can be useful. As explained here, the RandomForestClassifier has a parameter warm_start that "if it's set to True, the classifier reuses the solution of the previous call to fit and add more estimators to the ensemble, otherwise, just fit a whole new forest".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants