Comment on page

# Descent Methods

## Descent Methods

Consider the unconstrained minimization problem. The general iteration formula in Descent Methods is
$x_{k+1} = x_k + t_k \Delta x_k$
, where
$t_k$
is the step size and
$\Delta x_k$
is the step direction at iteration
$k$
. The outline of a general Descent Method is as follows.
given a starting point
$x$
. repeat 1. Determine a descent direction
$\Delta x$
. 2. Line search. Choose a step size
$t>0$
. 3. Update.
$x:=x+t \Delta x$
. until stopping criterion is satisfied.
Different algorithms differ with respect to the choice of
• Descent Direction:
$\Delta x_k$
• Steepest Descent Method
• Newton Method
• Step Size:
$t_k$
• Fixed Step Size
• Exact Line Search
• Backtracking Line Search
To choose a Descent Direction, a necessary condition is
$\nabla f(x)^T \Delta x < 0$
.
• Steepest Descent Method
$\Delta x = - P^{-1} \nabla f(x)$
• by choosing different norm P we can have different directions.
$\Delta x = - \nabla f(x)$
• is a special case of the steepest descent method when
$P$
is identity matrix.
• is the steepest in the first order approximation.
• Newton Method
$\Delta x = - \nabla^2 f(x)^{-1} \nabla f(x)$
• is a special case of the steepest descent method when
$P = \nabla^2 f(x)$
• is the steepest in the second order approximation.
References: EE230 Convex Optimization Lecture Slide Topic 6: Unconstrained Optimization; Convex Optimization notes provided by Dr. S. Boyd.

## Newton's Method

In numerical analysis, Newton's method is a root-finding algorithm which produces successively better approximations to the roots (or zeroes) of a real-valued function. In each step, the update equation is
$x_{n+1} = x_n - \frac{f(x_n)}{f'(x_n)}$
.
Limitations: It may not work if there are points of inflection, local maxima or minima around the initial guess
$x_0$
or the root. For example, suppose you need to find the root of
$27x^3 - 3x + 1 = 0$
which is near
$x = 0$
. Then Newton's method will take you back and forth, or farther away from the root.
function 27x^3 - 3x + 1 = 0
Application in Inverse Kinematics:
• Find
$\theta_d$
such that
$x_d - f(\theta_d) = 0$
where
$x_d$
is the desired point in taskspace that we would like to reach, and
$f(\theta)$
is the forward kinematics model of the manipulator.
• In each iteration,
$\theta^{i+1} = \theta^i + J^{-1}(\theta^i)(x_d - f(\theta^i))$
where
$J^{-1}(\theta^i)$
is the inverse of the Jacobian matrix. For non-square Jacobian matrices, use pseudoinverse instead.

## Newton's Method in Optimization

In optimization, Newton's method is applied to the derivative
$f'$
of a twice-differentiable function
$f$
to find the roots of the derivative (solutions to
$f'(x) = 0$
), also known as the stationary points of
$f$
. These solutions may be minima, maxima, or saddle points.
Putting everything together, Newton's method performs the iteration
$x_{k+1} = x_{k} - \frac{f'(x_k)}{f''(x_k)} = x_{k}-[f''(x_{k})]^{-1}f'(x_{k}) = x_k - [\nabla ^{2}f(x)]^{-1}\nabla f(x),$
where
$\nabla f(x)$
is the gradient and
$\nabla ^{2}f(x) = H_f(x)$
is the Hessian matrix.
Often Newton's method is modified to include a small step size
$0<\gamma \leq 1$
$\gamma=1$
:
$x_{k+1}=x_{k}-\gamma [f''(x_{k})]^{-1}f'(x_{k})$
. For step sizes other than 1, the method is often referred to as the relaxed or damped Newton's method.

## Gauss-Newton Algorithm

The Gauss–Newton algorithm is used to solve non-linear least squares problems. It is a modification of Newton's method for finding a minimum of a function.
Unlike Newton's method, the Gauss–Newton algorithm can only be used to minimize a sum of squared function values, but it has the advantage that second derivatives, which can be challenging to compute, are not required.
Starting with an initial guess
$\boldsymbol{\beta}^{(0)}$
for the minimum, the method proceeds by the iterations
$\boldsymbol{\beta}^{(k+1)}= \boldsymbol{\beta}^{(k)}-\left(\mathbf {J_{r}} ^{\mathsf {T}}\mathbf {J_{r}} \right)^{-1}\mathbf {J_{r}} ^{\mathsf {T}}\mathbf {r} \left(\boldsymbol{\beta}^{(k)}\right)$
Note that
$\left(\mathbf {J_{r}} ^{\mathsf {T}}\mathbf {J_{r}} \right)^{-1}\mathbf {J_{r}} ^{\mathsf {T}}$
is the left pseudoinverse of
$\mathbf {J_{r}}$
, and assumption
$m \ge n$
in the algorithm statement is necessary, as otherwise the matrix
$\mathbf {J_{r}} ^{\mathsf {T}}\mathbf {J_{r}}$
is not invertible and the normal equations cannot be solved uniquely.

## Levenberg–Marquardt Algorithm

Levenberg–Marquardt algorithm (LMA), also known as the damped least-squares (DLS) method, is used to solve non-linear least squares problems. The LMA interpolates between the Gauss–Newton algorithm (GNA) and the method of gradient descent.
With the Gauss–Newton method, the sum of squares of the residuals
$S$
may not decrease at every iteration. However, since
$\Delta$
is a descent direction, unless
$S (\boldsymbol {\beta }^{k})$
is a stationary point, it holds that
$S (\boldsymbol{\beta}^{k}+\alpha \Delta ) < S (\boldsymbol{\beta}^{k})$
for all sufficiently small
$\alpha >0$
. Thus, if divergence occurs, one solution is to employ a fraction
$\alpha$
of the increment vector
$\Delta$
in the updating formula:
${\boldsymbol {\beta }}^{k+1}={\boldsymbol {\beta }}^{k}+\alpha \Delta$
. In other words, the increment vector is too long, but it still points "downhill", so going just a part of the way will decrease the objective function
$S$
. An optimal value for
$\alpha$
can be found by using a line search algorithm.
In cases where the direction of the shift vector is such that the optimal fraction
$\alpha$
is close to zero, an alternative method for handling divergence is the use of the Levenberg–Marquardt algorithm, a trust region method. The normal equations are modified in such a way that the increment vector is rotated towards the direction of steepest descent,
$\left(\mathbf {J^{\mathsf {T}}J+\lambda D} \right)\Delta =-\mathbf {J} ^{\mathsf {T}}\mathbf {r}$
where
$D$
is a positive diagonal matrix and
$\lambda$
is a parameter that controls the trust-region size. Geometrically, this adds a paraboloid centered at
$\Delta x=0$
to the quadratic form, resulting in a smaller step. Note that when
$D$
is the identity matrix and
$\lambda \to +\infty$
, then
$\lambda \Delta \to -\mathbf {J} ^{\mathsf {T}}\mathbf {r}$
, therefore the direction of
$\Delta$
approaches the direction of the negative gradient
$-\mathbf {J} ^{\mathsf {T}}\mathbf {r}$
.
Conclusions:
• The main difference between GN and LM methods is the selection of matrix
$H$
. One is
$H=\mathbf{J}^{\mathsf{T}} \mathbf{J}$
and the other is
$H=\mathbf{J}^{\mathsf{T}} \mathbf{J} +\lambda D$
.
• When the function can be well approximated by second order expansion, we use the Gauss-Newton method.
• When the approximation to the problem is not good, where the matrix
$\mathbf{J}^{\mathsf{T}} \mathbf{J}$
may not be invertible or under ill condition, we use the Levenberg–Marquardt method.