October 27, 2020

Descent Optimization for Finding Local Minima

This is a group of the first order and Second-Order Optimization Techniques. In this, we take steps and go down the descent vector with a certain step size. This results in a decrease in the value of y and hence iteration by iteration our function is optimized. There are clear steps-

  1. Iterate until x^k satisfies the termination condition.
  2. Determine descent direction d^k using gradient(First Order Derivative) or Hessian(Second Order Derivative) Matrix.
  3. Determine the step size- \alpha^k to take along the descent direction.
  4. Update the new Design Point, x^{k+1}=x^k+\alpha^kd^k

Line Search

For this, we perform Line Search for optimization in which we minimize the one-dimensional function f(x+\alpha d). We can increase the size of \alpha for larger steps and faster convergence. But smaller steps are considered better otherwise minima will be overshot, So we use decaying learning rate which decays the value of \alpha as per given parameter.

We have a better method of line Search called Approximate Line search, this does not require precision for minimization which can be computationally efficient. So we have to take more steps but each step is computationally efficient and as the function becomes more and more complex sometimes becomes non-feasible.

For Approximate Line Search it requires two conditions t be satisfied-

1. First Wolfe Condition for optimization –

According to this condition if you are increasing the size of each step, then the decrease in value of the function should be sufficient. This condition is also called Sufficient Decrease condition or Armijo condition.

Mathematically,

f(x^k)-f(x^k+\alpha ^k d^k)>= \alpha^k \gamma , such that \gamma>0 .

here \gamma is a arbitrary value till now whose value have to be found.

first wolfe condition for optimization


\gamma will be dependent on what is the value of descent. If descent is large then large learning rate are allowed that will result in faster convergence.

\gamma = -\beta \bigtriangledown f(x^k)^T d^k , here \beta is for flexibility in vlaue of \gamma .

0< \beta <1.

f(x^k+\alpha ^k d^k)<= f(x^k) + \beta \alpha^k \bigtriangledown f(x^k)^T d^k

For \beta=0 ,

f(x^k+\alpha ^k d^k)<= f(x^k) which is not sufficient for descent.

For \beta=1 ,

f(x^k+\alpha ^k d^k)<= f(x^k) + \alpha^k \bigtriangledown f(x^k)^T d^k , which is equation of tangent.

Source of this graph.

The values of Beta closer to 1 are nearer to the tangent and will produce less descent, So our values should be nearer to 0.

\beta=0.0001 is widely agreed on

Second Wolfe Condition for optimization

This condition is to make sure that the update step is not too short because then it will not converge. This is also called Curvature Condition.

Directional Detrivative along x^k+\alpha^kd^k ,

At x, \bigtriangledown f(x^k)^Td^k<0

At x^k , local minima, \bigtriangledown f(x^k+\alpha^k d^k)^Td^k=0

We reach local minima, by then value of directional derivative increases by |\bigtriangledown f(x^k)^Td^k|

If we want to reach minima in one step then this should be the value, but it will be computationally very demanding. That iss the reason we are trying to modify it.Let’s settle for \sigma , i.e., ratio of directional derivative of two consecutive steps should be less than \sigma .

\frac{\bigtriangledown f(x^k+\alpha^k d^k)^Td^k}{\bigtriangledown f(x^k)^Td^k}<= \sigma \bigtriangledown f(x^k+\alpha^k d^k)^Td^k >= \sigma \bigtriangledown f(x^k)^Td^k

Our value for \sigma should be closer to 0. Widely acceptable value is 0.1

It is common to set \beta < \sigma < 1 .

The sufficient decrease condition with the second curvature condition forms the strong Wolfe conditions.

|\bigtriangledown f(x^{k+1})d^k| ≤ −\sigma \bigtriangledown f(x^k)d^k

Trust Region Method

Trust region methods constrain the next step to lie within a local region. It calculates the improvement based on the value of step size. A circular region is set on the curve. If ratio predicted improvement and actual improvement is close to 1 then the trust region is increased and if not close to 0 then the trust region is decreased. If the trust Region increases then step size is increased and then descent direction is calculated.

trust region method
\eta = \frac{actual improvement}{predicted improvements} = \frac{f(x)-f(x')}{f(x)-\hat{f}(x')}

Termination Condition for Descent –

  1. Maximum Iterations
  2. Absolute Improvement – f(x^k) ⇒ f(x^{k+1}) < \epsilon_a
  3. Relative Improvement – f(x^k) ⇒ f(x^{k+1}) < \epsilon_a|f(x^k)|
  4. Gradient Magnitude – If the gradient becomes very close to 0.

Leave a Reply