Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Robust loss functions for outlier rejection #332

Closed
Affie opened this issue Dec 15, 2023 · 17 comments
Closed

Robust loss functions for outlier rejection #332

Affie opened this issue Dec 15, 2023 · 17 comments

Comments

@Affie
Copy link
Contributor

Affie commented Dec 15, 2023

Is it possible to use loss functions such as Huber, Tukey, and adaptive in Manopt.jl (specifically RLM)?
Similar to ceres solver: http://ceres-solver.org/nnls_modeling.html#lossfunction

@mateuszbaran
Copy link
Member

Not at the moment but it looks like a relatively simple modification of RLM to support it. The rescaling described in http://ceres-solver.org/nnls_modeling.html#theory looks like it would work for RLM too.

@mateuszbaran
Copy link
Member

@kellertuer do you think it would be better to add robustification support to the current implementation of RLM or make a separate implementation?

@kellertuer
Copy link
Member

kellertuer commented Dec 15, 2023

In the Exact Penalty Method (EPM) that is included in the method itself, so I think that would be fitting here as well. We even have types already for different types of relaxations (e.g. Huber).

If we extend those and use them in RLM as well, I think that would be great.

edit: to provide a link, we currently have https://manoptjl.org/stable/solvers/exact_penalty_method/#Manopt.SmoothingTechnique – those could either be extended or combined into a common framework of robustifcation / smoothing.

@mateuszbaran
Copy link
Member

Cool 👍 . RLM seems to need a somewhat different interface though.

@kellertuer
Copy link
Member

Sure, no problem – maybe we can also revise the interface with EPM a bit to have a common one then.

@mateuszbaran mateuszbaran mentioned this issue May 5, 2024
13 tasks
@kellertuer
Copy link
Member

I revisited this and for the most general case (outside of RLM), I have no real idea how that could be done or what that would even mean for an algorithm like Douglas-Rachford for example.

Within RLM, one could really just store the ρ function mentioned in the docs linked above and and apply that element wise.
It would probably be best to have that as a HessianObjective (though a bit verbose for a 1D-function) to store its derivative and second derivative as well, since the Jacobian is affected by this choice though a chain rule as well.

Ah I am not so sure we need the second derivative for now? We only use first order information in RLM for now I think. Then besides that field only the get_cost and get_jacobian! for the NonlinearLeastSquaresObjective have to be adapted.
And probably a GradientObjective (a function and its derivative in the 1D case) I enough flexibility.

@mateuszbaran
Copy link
Member

I revisited this and for the most general case (outside of RLM), I have no real idea how that could be done or what that would even mean for an algorithm like Douglas-Rachford for example.

I don't think it can work for any loss function, only those without splitting, nonlinear least squares and most likely stochastic optimization, though I couldn't find any papers about what would be the Euclidean variant. Maybe let's tackle each case separately?

Within RLM, one could really just store the ρ function mentioned in the docs linked above and and apply that element wise.
It would probably be best to have that as a HessianObjective (though a bit verbose for a 1D-function) to store its derivative and second derivative as well, since the Jacobian is affected by this choice though a chain rule as well.

Ah I am not so sure we need the second derivative for now? We only use first order information in RLM for now I think. Then besides that field only the get_cost and get_jacobian! for the NonlinearLeastSquaresObjective have to be adapted.
And probably a GradientObjective (a function and its derivative in the 1D case) I enough flexibility.

As far as I can tell, ρ is not applied elementwise to the output of f but only after the squared norm is computed. Also, maybe it would be most natural to store ρ in NonlinearLeastSquaresObjective.

In an earlier post you had the idea of combining robust loss functions with SmoothingTechnique, maybe we can do that instead of representing it as a HessianObjective?

@kellertuer
Copy link
Member

You are right in full generality that might not be possible, so let's just do it for RLM for now.

Yes, storing \rho in the NLSObj would be my idea for now as well, but we have to store its derivative as well, so that is why I was thinking about storing it as a gradient objective to stay generic.

Yes the smoothing techniques are basically the same, though there we handle that with storing a symbol (and only support 2 functions). I am not yet sure how well this can be combined, but it would most probably be the ELM being adapted to the mode here (since their symbol-approach is far more restrictive than storing the smoothing function).

@kellertuer
Copy link
Member

kellertuer commented Dec 27, 2024

Hi,
just about a year after this was opened, a small update: The more I read he docs above, the less I like them ;) For example they have a scaling where their smoothing variable is $a^2\rho(x/a^2)$ instead of just x (why ever they use the $^2$) which has an effect in the derivatives of course. And they have a scaled smoothing as $a\rho(x)$. Sure they use the same name for both ...

One thing I am not yet so sure about is, whether we need the second derivative in out tangent-space-Jacobian-model subproblem, I will have to think about that. Maybe we do not.
Ceres does not mention what they do, but they seem to only do an approximate Hessian, since they write “where the terms involving the second derivatives of $f(x)$ have been ignored.” – while we seem to do a regularised Jacobian problem as a sub problem which does not require the second derivative? I have to check that a bit closer.

For now I think, since we solve the subproblem (2.1) from https://arxiv.org/pdf/2210.00253, this fixes a Jacobian and does not involve a derivative of the Jacobian. In the Theory section of the Ceres docs that would just be the gradient (since they consider a single function) and then they introduce an approximate Hessian. Our approach, instead, basically approximates the Hessian implicitly, and hence does not require $\rho''$. But I will check and document this carefully.

For now (yesterday that is) I just worked a bit on generalising the vector functions we have to be able to do LM with them. Then the chain rule with the smoothing should not be that complicated. Will open a PR once this seems to work or once I am stuck somewhere.
So I think all necessary tools should already be available, one “just” has to adapt the get_jacobian function of the nonlinear least squares objective.

@mateuszbaran
Copy link
Member

mateuszbaran commented Dec 27, 2024

just about a year after this was opened, a small update: The more I read he docs above, the less I like them ;) For example they have a scaling where their smoothing variable is a 2 ρ ( x / a 2 ) instead of just x (why ever they use the 2 ) which has an effect in the derivatives of course. And they have a scaled smoothing as a ρ ( x ) . Sure they use the same name for both ...

One thing I am not yet so sure about is, whether we need the second derivative in out tangent-space-Jacobian-model subproblem, I will have to think about that. Maybe we do not. Ceres does not mention what they do, but they seem to only do an approximate Hessian, since they write “where the terms involving the second derivatives of f ( x ) have been ignored.” – while we seem to do a regularised Jacobian problem as a sub problem which does not require the second derivative? I have to check that a bit closer.

Scaling by $a^2\rho(x/a^2)$ makes sense to me because they want the factor $a$ to have the same units as elements of the residual $f(x)$, and they want the rescaled function to have the same order of magnitude as the original one for small residuals.

For now I think, since we solve the subproblem (2.1) from https://arxiv.org/pdf/2210.00253, this fixes a Jacobian and does not involve a derivative of the Jacobian. In the Theory section of the Ceres docs that would just be the gradient (since they consider a single function) and then they introduce an approximate Hessian. Our approach, instead, basically approximates the Hessian implicitly, and hence does not require ρ ″ . But I will check and document this carefully.

For now (yesterday that is) I just worked a bit on generalising the vector functions we have to be able to do LM with them. Then the chain rule with the smoothing should not be that complicated. Will open a PR once this seems to work or once I am stuck somewhere. So I think all necessary tools should already be available, one “just” has to adapt the get_jacobian function of the nonlinear least squares objective.

I think Trigg's paper is a bit more clear than Ceres docs when it comes to the use of Hessian. It won't modify the structure of the subproblem (2.1) but it will modify what we take as $J_k$ based on the second derivative of $\rho$.

@kellertuer
Copy link
Member

Sure I can do the scaling by 1/a^2 as well (currently I just do 1/a), that is no problem.

And yes, Triggs paper is my next go to. Hopefully I understand it a bit better then.
Sure, I moved already most features of LM to VectorFunctions, the next step is to include the smoothing function into the objectives Jacobian as well. That is where I still have to read a bit about this.
Just that today the 38c3 starts, so I am watching talks for most of today probably.

(and I will call $\rho$ just s to avoid visible confusion with p.)

@kellertuer
Copy link
Member

For me Triggs stays as vague as the Ceres docs in the end of Sec 4.3 (Eq. 10 and following texts), especially understanding $H_i$ better is what I still have to do, because it seems both only approximate it and also only state it and not derive it. So I probably have to derive it for our case myself to actually understand what is happening (besides a chain rule); especially why their Hessian only includes a Jacobian and not a derivative of the Jacobian as well.

@mateuszbaran
Copy link
Member

The derivations were left as an exercise to the reader 😃. Some derivations are standard for Gauss-Newton so you can look for that if you want to find a more detailed description. I can check it if you have trouble figuring out some part.

@kellertuer
Copy link
Member

Yeah, as well as references are left to the readers imagination.

I think my main “mismatch” is currently, that we use J*J (while Triggs has some other J) and Triggs uses the Hessian, which we do not use. So to some extend the methods are quite different, at least in derivation. And I am too lazy to(or stupid) to derive the method with the Riemannian Hessian of the chain rule $s_i(f_i(p))$, that is where I got stuck after not so long and did not yet continue.

So I do understand the derivation of Triggs – sure. Just that our model approximation in the tangent space seems to be a different one – and I am not able to match both.
I think I could still even implement most of it, my main problem is, that the chain rule also requires a new $\lambda_k$ upfront the identity matrix – and that I can not see for now. My fear is, that when I would carefully write it down that is nearly a small paper ;) – I would be open to writing such a paper but maybe we should finish other projects first.

@mateuszbaran
Copy link
Member

I guess we don't need the Riemannian Hessian just as the Riemannian LM paper doesn't use one, though then we have to derive the corrections from scratch.

And given that Ceres also works with box constraints, this starts to looks like a material for a slightly larger paper 😉 . That's something I'd like to work on but first we need to finish the Kalman filter one.

@kellertuer
Copy link
Member

That is what I mean, let's first do the Kalman project ;)

I worked a bit on the code today, and I think I am nearly done with a rework, that still uses slightly wrong lamda_k but the rest should soon (tm) be done.

@kellertuer
Copy link
Member

While #432 did now rework LM to work much nicer with vectorial functions and hence allows more ways to provide the first order information, it will not resolve this issue

  • it does not do block, only single element smoothing, which is too specific
  • the required rescaling due to Triggs is missing, since on manifolds that theory does not exist.

So this issue had a nice side effect of some restructuring.
But for now especially due to the second point I would answer the original question as

No. There does not exist enough theory to use robust loss functions in Manopt.jl. If it is ok, I would prefer to close this issue, since without the theory, there is no real perspective on this being doable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants