Comment on page

Descent Methods

Descent Methods

Consider the unconstrained minimization problem. The general iteration formula in Descent Methods is
xk+1=xk+tkΔxk x_{k+1} = x_k + t_k \Delta x_k
, where
tkt_k
is the step size and
Δxk\Delta x_k
is the step direction at iteration
kk
. The outline of a general Descent Method is as follows.
given a starting point
xx
. repeat 1. Determine a descent direction
Δx\Delta x
. 2. Line search. Choose a step size
t>0t>0
. 3. Update.
x:=x+tΔxx:=x+t \Delta x
. until stopping criterion is satisfied.
Different algorithms differ with respect to the choice of
  • Descent Direction:
    Δxk\Delta x_k
    • Gradient Method
    • Steepest Descent Method
    • Newton Method
  • Step Size:
    tkt_k
    • Fixed Step Size
    • Exact Line Search
    • Backtracking Line Search
To choose a Descent Direction, a necessary condition is
f(x)TΔx<0\nabla f(x)^T \Delta x < 0
.
  • Steepest Descent Method
    Δx=P1f(x)\Delta x = - P^{-1} \nabla f(x)
    • by choosing different norm P we can have different directions.
  • Gradient Method
    Δx=f(x)\Delta x = - \nabla f(x)
    • is a special case of the steepest descent method when
      PP
      is identity matrix.
    • is the steepest in the first order approximation.
  • Newton Method
    Δx=2f(x)1f(x)\Delta x = - \nabla^2 f(x)^{-1} \nabla f(x)
    • is a special case of the steepest descent method when
      P=2f(x)P = \nabla^2 f(x)
    • is the steepest in the second order approximation.
References: EE230 Convex Optimization Lecture Slide Topic 6: Unconstrained Optimization; Convex Optimization notes provided by Dr. S. Boyd.

Newton's Method

In numerical analysis, Newton's method is a root-finding algorithm which produces successively better approximations to the roots (or zeroes) of a real-valued function. In each step, the update equation is
xn+1=xnf(xn)f(xn)x_{n+1} = x_n - \frac{f(x_n)}{f'(x_n)}
.
Limitations: It may not work if there are points of inflection, local maxima or minima around the initial guess
x0x_0
or the root. For example, suppose you need to find the root of
27x33x+1=027x^3 - 3x + 1 = 0
which is near
x=0x = 0
. Then Newton's method will take you back and forth, or farther away from the root.
function 27x^3 - 3x + 1 = 0
Application in Inverse Kinematics:
  • Find
    θd\theta_d
    such that
    xdf(θd)=0x_d - f(\theta_d) = 0
    where
    xdx_d
    is the desired point in taskspace that we would like to reach, and
    f(θ)f(\theta)
    is the forward kinematics model of the manipulator.
  • In each iteration,
    θi+1=θi+J1(θi)(xdf(θi))\theta^{i+1} = \theta^i + J^{-1}(\theta^i)(x_d - f(\theta^i))
    where
    J1(θi)J^{-1}(\theta^i)
    is the inverse of the Jacobian matrix. For non-square Jacobian matrices, use pseudoinverse instead.

Newton's Method in Optimization

In optimization, Newton's method is applied to the derivative
ff'
of a twice-differentiable function
ff
to find the roots of the derivative (solutions to
f(x)=0f'(x) = 0
), also known as the stationary points of
ff
. These solutions may be minima, maxima, or saddle points.
Putting everything together, Newton's method performs the iteration
xk+1=xkf(xk)f(xk)=xk[f(xk)]1f(xk)=xk[2f(x)]1f(x),x_{k+1} = x_{k} - \frac{f'(x_k)}{f''(x_k)} = x_{k}-[f''(x_{k})]^{-1}f'(x_{k}) = x_k - [\nabla ^{2}f(x)]^{-1}\nabla f(x),
where
f(x)\nabla f(x)
is the gradient and
2f(x)=Hf(x)\nabla ^{2}f(x) = H_f(x)
is the Hessian matrix.
Often Newton's method is modified to include a small step size
0<γ10<\gamma \leq 1
instead of
γ=1\gamma=1
:
xk+1=xkγ[f(xk)]1f(xk)x_{k+1}=x_{k}-\gamma [f''(x_{k})]^{-1}f'(x_{k})
. For step sizes other than 1, the method is often referred to as the relaxed or damped Newton's method.

Gauss-Newton Algorithm

The Gauss–Newton algorithm is used to solve non-linear least squares problems. It is a modification of Newton's method for finding a minimum of a function.
Unlike Newton's method, the Gauss–Newton algorithm can only be used to minimize a sum of squared function values, but it has the advantage that second derivatives, which can be challenging to compute, are not required.
Starting with an initial guess
β(0)\boldsymbol{\beta}^{(0)}
for the minimum, the method proceeds by the iterations
β(k+1)=β(k)(JrTJr)1JrTr(β(k)) \boldsymbol{\beta}^{(k+1)}= \boldsymbol{\beta}^{(k)}-\left(\mathbf {J_{r}} ^{\mathsf {T}}\mathbf {J_{r}} \right)^{-1}\mathbf {J_{r}} ^{\mathsf {T}}\mathbf {r} \left(\boldsymbol{\beta}^{(k)}\right)
Note that
(JrTJr)1JrT\left(\mathbf {J_{r}} ^{\mathsf {T}}\mathbf {J_{r}} \right)^{-1}\mathbf {J_{r}} ^{\mathsf {T}}
is the left pseudoinverse of
Jr\mathbf {J_{r}}
, and assumption
mnm \ge n
in the algorithm statement is necessary, as otherwise the matrix
JrTJr\mathbf {J_{r}} ^{\mathsf {T}}\mathbf {J_{r}}
is not invertible and the normal equations cannot be solved uniquely.

Levenberg–Marquardt Algorithm

Levenberg–Marquardt algorithm (LMA), also known as the damped least-squares (DLS) method, is used to solve non-linear least squares problems. The LMA interpolates between the Gauss–Newton algorithm (GNA) and the method of gradient descent.
With the Gauss–Newton method, the sum of squares of the residuals
SS
may not decrease at every iteration. However, since
Δ\Delta
is a descent direction, unless
S(βk)S (\boldsymbol {\beta }^{k})
is a stationary point, it holds that
S(βk+αΔ)<S(βk) S (\boldsymbol{\beta}^{k}+\alpha \Delta ) < S (\boldsymbol{\beta}^{k})
for all sufficiently small
α>0 \alpha >0
. Thus, if divergence occurs, one solution is to employ a fraction
α \alpha
of the increment vector
Δ\Delta
in the updating formula:
βk+1=βk+αΔ{\boldsymbol {\beta }}^{k+1}={\boldsymbol {\beta }}^{k}+\alpha \Delta
. In other words, the increment vector is too long, but it still points "downhill", so going just a part of the way will decrease the objective function
SS
. An optimal value for
α \alpha
can be found by using a line search algorithm.
In cases where the direction of the shift vector is such that the optimal fraction
α\alpha
is close to zero, an alternative method for handling divergence is the use of the Levenberg–Marquardt algorithm, a trust region method. The normal equations are modified in such a way that the increment vector is rotated towards the direction of steepest descent,
(JTJ+λD)Δ=JTr\left(\mathbf {J^{\mathsf {T}}J+\lambda D} \right)\Delta =-\mathbf {J} ^{\mathsf {T}}\mathbf {r}
where
DD
is a positive diagonal matrix and
λ\lambda
is a parameter that controls the trust-region size. Geometrically, this adds a paraboloid centered at
Δx=0\Delta x=0
to the quadratic form, resulting in a smaller step. Note that when
DD
is the identity matrix and
λ+\lambda \to +\infty
, then
λΔJTr\lambda \Delta \to -\mathbf {J} ^{\mathsf {T}}\mathbf {r}
, therefore the direction of
Δ\Delta
approaches the direction of the negative gradient
JTr-\mathbf {J} ^{\mathsf {T}}\mathbf {r}
.
Conclusions:
  • The main difference between GN and LM methods is the selection of matrix
    HH
    . One is
    H=JTJH=\mathbf{J}^{\mathsf{T}} \mathbf{J}
    and the other is
    H=JTJ+λDH=\mathbf{J}^{\mathsf{T}} \mathbf{J} +\lambda D
    .
  • When the function can be well approximated by second order expansion, we use the Gauss-Newton method.
  • When the approximation to the problem is not good, where the matrix
    JTJ\mathbf{J}^{\mathsf{T}} \mathbf{J}
    may not be invertible or under ill condition, we use the Levenberg–Marquardt method.