Descent Methods

Consider the unconstrained minimization problem. The general iteration formula in Descent Methods is $x_{k+1} = x_k + t_k \Delta x_k$ , where $t_k$ is the step size and $\Delta x_k$ is the step direction at iteration $k$ . The outline of a general Descent Method is as follows.

given a starting point $x$ . repeat 1. Determine a descent direction $\Delta x$ . 2. Line search. Choose a step size $t>0$ . 3. Update. $x:=x+t \Delta x$ . until stopping criterion is satisfied.

Different algorithms differ with respect to the choice of

Descent Direction: $\Delta x_k$
- Gradient Method
- Steepest Descent Method
- Newton Method
Step Size: $t_k$
- Fixed Step Size
- Exact Line Search
- Backtracking Line Search

To choose a Descent Direction, a necessary condition is $\nabla f(x)^T \Delta x < 0$ .

Steepest Descent Method $\Delta x = - P^{-1} \nabla f(x)$
- by choosing different norm P we can have different directions.
Gradient Method $\Delta x = - \nabla f(x)$
- is a special case of the steepest descent method when $P$ is identity matrix.
- is the steepest in the first order approximation.
Newton Method $\Delta x = - \nabla^2 f(x)^{-1} \nabla f(x)$
- is a special case of the steepest descent method when $P = \nabla^2 f(x)$
- is the steepest in the second order approximation.

References: EE230 Convex Optimization Lecture Slide Topic 6: Unconstrained Optimization; Convex Optimization notes provided by Dr. S. Boyd.

Newton's Method

In numerical analysis, Newton's method is a root-finding algorithm which produces successively better approximations to the roots (or zeroes) of a real-valued function. In each step, the update equation is $x_{n+1} = x_n - \frac{f(x_n)}{f'(x_n)}$ .

Limitations: It may not work if there are points of inflection, local maxima or minima around the initial guess $x_0$ or the root. For example, suppose you need to find the root of $27x^3 - 3x + 1 = 0$ which is near $x = 0$ . Then Newton's method will take you back and forth, or farther away from the root.

Application in Inverse Kinematics:

Find $\theta_d$ such that $x_d - f(\theta_d) = 0$ where $x_d$ is the desired point in taskspace that we would like to reach, and $f(\theta)$ is the forward kinematics model of the manipulator.
In each iteration, $\theta^{i+1} = \theta^i + J^{-1}(\theta^i)(x_d - f(\theta^i))$ where $J^{-1}(\theta^i)$ is the inverse of the Jacobian matrix. For non-square Jacobian matrices, use pseudoinverse instead.

References: https://en.wikipedia.org/wiki/Newton%27s_method; https://brilliant.org/wiki/newton-raphson-method/; Chapter 6 in Modern Robotics textbook

Newton's Method in Optimization

In optimization, Newton's method is applied to the derivative $f'$ of a twice-differentiable function $f$ to find the roots of the derivative (solutions to $f'(x) = 0$ ), also known as the stationary points of $f$ . These solutions may be minima, maxima, or saddle points.

Putting everything together, Newton's method performs the iteration

$x_{k+1} = x_{k} - \frac{f'(x_k)}{f''(x_k)} = x_{k}-[f''(x_{k})]^{-1}f'(x_{k}) = x_k - [\nabla ^{2}f(x)]^{-1}\nabla f(x),$

where $\nabla f(x)$ is the gradient and $\nabla ^{2}f(x) = H_f(x)$ is the Hessian matrix.

Often Newton's method is modified to include a small step size $0<\gamma \leq 1$ instead of $\gamma=1$ : $x_{k+1}=x_{k}-\gamma [f''(x_{k})]^{-1}f'(x_{k})$ . For step sizes other than 1, the method is often referred to as the relaxed or damped Newton's method.

Reference: https://en.wikipedia.org/wiki/Newton%27s_method_in_optimization

Gauss-Newton Algorithm

The Gauss–Newton algorithm is used to solve non-linear least squares problems. It is a modification of Newton's method for finding a minimum of a function.

Unlike Newton's method, the Gauss–Newton algorithm can only be used to minimize a sum of squared function values, but it has the advantage that second derivatives, which can be challenging to compute, are not required.

Starting with an initial guess $\boldsymbol{\beta}^{(0)}$ for the minimum, the method proceeds by the iterations

$\boldsymbol{\beta}^{(k+1)}= \boldsymbol{\beta}^{(k)}-\left(\mathbf {J_{r}} ^{\mathsf {T}}\mathbf {J_{r}} \right)^{-1}\mathbf {J_{r}} ^{\mathsf {T}}\mathbf {r} \left(\boldsymbol{\beta}^{(k)}\right)$

Note that $\left(\mathbf {J_{r}} ^{\mathsf {T}}\mathbf {J_{r}} \right)^{-1}\mathbf {J_{r}} ^{\mathsf {T}}$ is the left pseudoinverse of $\mathbf {J_{r}}$ , and assumption $m \ge n$ in the algorithm statement is necessary, as otherwise the matrix $\mathbf {J_{r}} ^{\mathsf {T}}\mathbf {J_{r}}$ is not invertible and the normal equations cannot be solved uniquely.

Reference: https://en.wikipedia.org/wiki/Gauss%E2%80%93Newton_algorithm

Levenberg–Marquardt Algorithm

Levenberg–Marquardt algorithm (LMA), also known as the damped least-squares (DLS) method, is used to solve non-linear least squares problems. The LMA interpolates between the Gauss–Newton algorithm (GNA) and the method of gradient descent.

With the Gauss–Newton method, the sum of squares of the residuals $S$ may not decrease at every iteration. However, since $\Delta$ is a descent direction, unless $S (\boldsymbol {\beta }^{k})$ is a stationary point, it holds that $S (\boldsymbol{\beta}^{k}+\alpha \Delta ) < S (\boldsymbol{\beta}^{k})$ for all sufficiently small $\alpha >0$ . Thus, if divergence occurs, one solution is to employ a fraction $\alpha$ of the increment vector $\Delta$ in the updating formula: ${\boldsymbol {\beta }}^{k+1}={\boldsymbol {\beta }}^{k}+\alpha \Delta$ . In other words, the increment vector is too long, but it still points "downhill", so going just a part of the way will decrease the objective function $S$ . An optimal value for $\alpha$ can be found by using a line search algorithm.

In cases where the direction of the shift vector is such that the optimal fraction $\alpha$ is close to zero, an alternative method for handling divergence is the use of the Levenberg–Marquardt algorithm, a trust region method. The normal equations are modified in such a way that the increment vector is rotated towards the direction of steepest descent,

$\left(\mathbf {J^{\mathsf {T}}J+\lambda D} \right)\Delta =-\mathbf {J} ^{\mathsf {T}}\mathbf {r}$

where $D$ is a positive diagonal matrix and $\lambda$ is a parameter that controls the trust-region size. Geometrically, this adds a paraboloid centered at $\Delta x=0$ to the quadratic form, resulting in a smaller step. Note that when $D$ is the identity matrix and $\lambda \to +\infty$ , then $\lambda \Delta \to -\mathbf {J} ^{\mathsf {T}}\mathbf {r}$ , therefore the direction of $\Delta$ approaches the direction of the negative gradient $-\mathbf {J} ^{\mathsf {T}}\mathbf {r}$ .

Conclusions:

The main difference between GN and LM methods is the selection of matrix $H$ . One is $H=\mathbf{J}^{\mathsf{T}} \mathbf{J}$ and the other is $H=\mathbf{J}^{\mathsf{T}} \mathbf{J} +\lambda D$ .
When the function can be well approximated by second order expansion, we use the Gauss-Newton method.
When the approximation to the problem is not good, where the matrix $\mathbf{J}^{\mathsf{T}} \mathbf{J}$ may not be invertible or under ill condition, we use the Levenberg–Marquardt method.

References: https://en.wikipedia.org/wiki/Gauss%E2%80%93Newton_algorithm; https://en.wikipedia.org/wiki/Levenberg%E2%80%93Marquardt_algorithm

PreviousConvex Optimization NextProbability

Last updated 2 years ago

Was this helpful?