Machine Learning Optimization Methods “Mechanics, Pros, And Cons”

Salmen Zouari
23 min readJan 14, 2021

--

The principal goal of machine learning is to create a model that performs well and gives accurate predictions in a particular set of cases. In order to achieve that, we need machine learning optimization.

Machine learning optimization is the process of adjusting the hyperparameters in order to minimize the cost function by using one of the optimization techniques. It is important to minimize the cost function because it describes the discrepancy between the true value of the estimated parameter and what the model has predicted.

In this article, we will discuss the main types of ML optimization techniques and see the advantages and the disadvantages of each technique.

1. Feature Scaling

Feature Scaling is a technique to standardize the independent features present in the data in a fixed range. It is performed during the data preprocessing to handle highly varying magnitudes or values or units. If feature scaling is not done, then a machine learning algorithm tends to weigh greater values, higher and consider smaller values as the lower values, regardless of the unit of the values.

Techniques to perform Feature Scaling
Consider the two most important ones:

  • Min-Max Normalization: This technique re-scales a feature or observation value with distribution value between 0 and 1.
Attaining Global minimum before and after scaling
  • Standardization: It is a very effective technique which re-scales a feature value so that it has distribution with 0 mean value and variance equals to 1.
Before and after using Standardization

Advantages

Example: If an algorithm is not using feature scaling method then it can consider the value 3000 meter to be greater than 5 km but that’s actually not true and in this case, the algorithm will give wrong predictions. So, we use Feature Scaling to bring all values to same magnitudes and thus, tackle this issue.

Disadvantages

Feature scaling usually helps, but it is not guaranteed to improve performance. If you use distance-based methods like SVM, omitting scaling will basically result in models that are disproportionately influenced by the subset of features on a large scale. It may well be the case that those features are in fact the best ones you have. In that case, scaling will reduce performance.

2. Batch normalization

Normalisation is a technique to change the values of numeric columns in the dataset to a common scale, without distorting differences in the ranges of values. This technique is generally applied as part of data preparation for machine learning and is necessary if various input features are in a different range of values.

Using batch normalization during inference can be a bit tricky. This is because we might not always have a batch during inference time. For example, consider running an object detector on a video in real time. A single frame is processed at once, and hence there is no batch.

This is crucial since we need to compute the mean ^x

and variance σ2

of a batch to produce the output of the batch norm layer. In that case, we keep a moving average of the mean and variance during training, and then plug these values for the mean and the variance during inference. This is the approach taken by most Deep Learning libraries that ship batch norm layers out of the box.

The justification of using a moving average rests on the law of large numbers. The mean and variance of a mini-batch is a very noisy estimate of the true mean and the variance. While the batch estimates are called the batch statistics, the true (unknown to us) values of mean and variance are called the population statistics. The law of large number states, that for a large number of samples, the batch statistics will tend to converge to population statistics and that is why we use a moving average during training. It also helps us even out the noise in the estimates produced owing to the mini batch nature of our optimization algorithm.

In case, we have the option of using batches at test time, we use the same equations as above, with an exception of a minor change in the equation where we calculate the standard deviation. Instead of the equation

we use,

The reason why we use m−1

in the denominator instead of m is that since we have already estimated the mean, we only have m−1 independent entities in our minibatch now. Had that not been the case, the mean could have been arbitrarily any number, but we do have a fixed mean which we are using to compute the variance.

Advantages

  1. Networks train faster — Each training iteration will actually be slower because of the extra calculations during the forward pass and the additional hyperparameters to train during back propagation. However, it should converge much more quickly, so training should be faster overall.
  2. Allows higher learning rates — Gradient descent usually requires small learning rates for the network to converge. And as networks get deeper, their gradients get smaller during back propagation so they require even more iterations. Using batch normalization allows us to use much higher learning rates, which further increases the speed at which networks train.
  3. Makes weights easier to initialize — Weight initialization can be difficult, and it’s even more difficult when creating deeper networks. Batch normalization seems to allow us to be much less careful about choosing our initial starting weights.
  4. Makes more activation functions viable — Some activation functions do not work well in some situations. Sigmoids lose their gradient pretty quickly, which means they can’t be used in deep networks. And ReLUs often die out during training, where they stop learning completely, so we need to be careful about the range of values fed into them. Because batch normalization regulates the values going into each activation function, non-linearlities that don’t seem to work well in deep networks actually become viable again.
  5. Simplifies the creation of deeper networks — Because of the first 4 items listed above, it is easier to build and faster to train deeper neural networks when using batch normalization. And it’s been shown that deeper networks generally produce better results, so that’s great.
  6. Provides a bit of regularlization — Batch normalization adds a little noise to your network. In some cases, such as in Inception modules, batch normalization has been shown to work as well as dropout. But in general, consider batch normalization as a bit of extra regularization, possibly allowing you to reduce some of the dropout you might add to a network.
  7. May give better results overall — Some tests seem to show batch normalization actually improves the training results. However, it’s really an optimization to help train faster, so you shouldn’t think of it as a way to make your network better. But since it lets you train networks faster, that means you can iterate over more designs more quickly. It also lets you build deeper networks, which are usually better. So when you factor in everything, you’re probably going to end up with better results if you build your networks with batch normalization.

Disadvantages

  1. Difficult to estimate mean and standard deviation of input during testing
  2. Cannot use batch size of 1 during training
  3. Computational overhead during training
  4. Not good for online learning
  5. Different calculation between train and test
  6. Not good for Recurrent Neural Networks and Long Short-Term Memory Networks

3. Mini-batch gradient descent

Is a variation of the gradient descent algorithm that splits the training datasets into small batches that are used to calculate model error and update model coefficients.

Let us understand like this,

suppose I have 1000 records and my batch size = 50, I will choose randomly 50 records, then calculate summation of loss and then send the loss to optimizer to find dE/dw.

Note: batches are formed in terms of random selection of datasets.

We use a batch of a fixed number of training examples which is less than the actual dataset and call it a mini-batch. Doing this helps us achieve the advantages of both the former variants we saw. So, after creating the mini-batches of fixed size, we do the following steps in one epoch:

  1. Pick a mini-batch
  2. Feed it to Neural Network
  3. Calculate the mean gradient of the mini-batch
  4. Use the mean gradient we calculated in step 3 to update the weights
  5. Repeat steps 1–4 for the mini-batches we created

So, when we are using the mini-batch gradient descent we are updating our parameters frequently as well as we can use vectorized implementation for faster computations.

Difference between gradient descent types

Advantages:

  1. The model update frequency is higher than Batch gradient descent: In Mini-Batch gradient descent, we are not waiting for entire data, we are just passing 50 records or 200 or 100 or 256, then we are passing for optimization.
  2. The batching allows both efficiency of not having all training data in memory and algorithms implementations. We are controlling memory consumption as well to store losses for each and every datasets.
  3. The batches updates provide a computationally more efficient process than Stochastic gradient descent.

Disadvantages:

  1. No guarantee of convergence of a error in a better way.
  2. Since the 50 sample records we take , are not representing the properties (or variance) of entire datasets. Do, this is the reason that we will not be able to get an convergence i.e., we won’t get absolute global or local minima at any point of a time.
  3. While using Mini-Batch gradient descent, since we are taking records in batches, so, it might happen that in some batches, we get some error and in dome other batches, we get some other error. So, we will have to control the learning rate by our self , whenever we use Mini-Batch gradient descent. If learning rate is very low, so the convergence rate will also fall. If learning rate is too high, we won’t get an absolute global or local minima. So we need to control the learning rate.

Note:If the batch size = total no. of data, then in this case,

Batch gradient descent = Mini-Batch gradient descent.

What is batch size?

The batch size defines the number of samples that will be propagated through the network.

For instance, let’s say you have 1050 training samples and you want to set up a batch_size equal to 100. The algorithm takes the first 100 samples (from 1st to 100th) from the training dataset and trains the network. Next, it takes the second 100 samples (from 101st to 200th) and trains the network again. We can keep doing this procedure until we have propagated all samples through of the network. Problem might happen with the last set of samples. In our example, we’ve used 1050 which is not divisible by 100 without remainder. The simplest solution is just to get the final 50 samples and train the network.

Advantages of using a batch size < number of all samples:

It requires less memory. Since you train the network using fewer samples, the overall training procedure requires less memory. That’s especially important if you are not able to fit the whole dataset in your machine’s memory.

Typically networks train faster with mini-batches. That’s because we update the weights after each propagation. In our example we’ve propagated 11 batches (10 of them had 100 samples and 1 had 50 samples) and after each of them we’ve updated our network’s parameters. If we used all samples during propagation we would make only 1 update for the network’s parameter.

Disadvantages of using a batch size < number of all samples:

The smaller the batch the less accurate the estimate of the gradient will be. In the figure below, you can see that the direction of the mini-batch gradient (green color) fluctuates much more in comparison to the direction of the full batch gradient (blue color).

Stochastic is just a mini-batch with batch_size equal to 1. In that case, the gradient changes its direction even more often than a mini-batch gradient.

4. Gradient descent with momentum

Gradient descent with momentum will always work much faster than the algorithm Standard Gradient Descent. The basic idea of Gradient Descent with momentum is to calculate the exponentially weighted average of your gradients and then use that gradient instead to update your weights.It functions faster than the regular algorithm for the gradient descent.

Stochastic gradient descent has trouble navigating ravines, i.e. areas where the surface curves much more steeply in one dimension than in another , which are common around local optima. In these scenarios, Stochastic gradient descent oscillates across the slopes of the ravine while only making hesitant progress along the bottom towards the local optimum as in “fig1”.

fig1

Momentum is a method that helps accelerate Stochastic gradient descent in the relevant direction and dampens oscillations as can be seen in “fig 2”. It does this by adding a fraction γ

fig2

Exponentially weighed averages deal with sequences of numbers. Suppose, we have some sequence S which is noisy. For this example I plotted cosine function and added some Gaussian noise. It looks like this:

Out sequence.

Note, that even though these dots seem very close to each over, none of them share x coordinate. It is a unique number for each point. That’s the number the defines the index of each point in our sequence S.

What we want to do with this data is, instead of using it, we want some kind of ‘moving’ average which would ‘denoise’ the data and bring it closer to the original function. Exponentially weighed averages can give us a pictures which looks like this:

momentum — data from exponentially weighed averages.

As you can see, that’s a pretty good result. Instead of having data with a lot of noise, we got much smoother line, which is closer to the original function than data we had. Exponentially weighed averages define a new sequence V with the following equation:

That sequence V is the one plotted yellow above. Beta is another hyper-parameter which takes values from 0 to one. I used beta = 0.9 above. It is a good value and most often used in Stochastic gradient descent with momentum. Intuitively, you can think of beta as follows. We’re approximately averaging over last 1 / (1- beta) points of sequence. Let’s see how the choice of beta affects our new sequence V.

Exponentially weighed averages for different values of beta.

As you can see, with smaller numbers of beta, the new sequence turns out to be fluctuating a lot, because we’re averaging over smaller number of examples and therefore are ‘closer’ to the noisy data. With bigger values of beta, like beta=0.98, we get much smother curve, but it’s a little bit shifted to the right, because we average over larger number of example(around 50 for beta=0.98). Beta = 0.9 provides a good balance between these two extremes.

We clearly see, that as we increase momentum, we get to the local minimum faster, but we might also overshoot and then it takes longer.

Advantages

Robust and efficient The momentum term increases in dimensions whose gradient continues in the same direction, and reduces updates for dimensions whose gradients change directions. As a result, we get faster convergence and reduced oscillation.

Disadvantages

  1. No rule-of-thumb for selecting mini-batch size
  2. Hence the number of epochs required will be extremely high, based on the initialization of ‘w’ and ‘b’. Another limitation of gradient descent update rule is that the learning can stop if the points are stuck in the local minimum.
  3. Momentum-based GD is able to take larger steps even in the regions of the gentle slopes, as the momentum of the history gradients is carried with it every step. But is taking a larger step is always better? How about when the global minimum is about to be reached? Would there be a situation where momentum would cause us to run past our goal? Consider the below example where the momentum-based GD overshoots and moves past the global minimum.

Although Momentum-based GD is faster than GD, it oscillates in and out of the minima valley. But can this oscillation be reduced? Of course, Nesterov Accelerated GD helps us to reduce the oscillations that happen in momentum based GD.

5. RMSProp optimization

RMSprop is a gradient based optimization technique used in training neural networks. It was proposed by the father of back-propagation, Geoffrey Hinton. Gradients of very complex functions like neural networks have a tendency to either vanish or explode as the data propagates through the function . RMSprop was developed as a stochastic technique for mini-batch learning.

RMSprop deals with the above issue by using a moving average of squared gradients to normalize the gradient. This normalization balances the step size (momentum), decreasing the step for large gradients to avoid exploding, and increasing the step for small gradients to avoid vanishing.

Simply put, RMSprop uses an adaptive learning rate instead of treating the learning rate as a hyperparametre. This means that the learning rate changes over time.

RMSprop, or Root Mean Square Propogation has an interesting history. It was devised by the legendary Geoffrey Hinton, while suggesting a random idea during a Coursera class.

RMSProp also tries to dampen the oscillations, but in a different way than momentum. RMS prop also takes away the need to adjust learning rate, and does it automatically. More so, RMSProp choses a different learning rate for each parameter.

In RMS prop, each update is done according to the equations described below. This update is done separately for each parameter.

So, let’s break down what is happening here.

In the first equation, we compute an exponential average of the square of the gradient. Since we do it separately for each parameter, gradient Gt here corresponds to the projection, or component of the gradient along the direction represented by the parameter we are updating.

To do that, we multiply the exponential average computed till the last update with a hyperparameter, represented by the greek symbol nu. We then multiply the square of the current gradient with (1 — nu). We then add them together to get the exponential average till the current time step.

The reason why we use exponential average is because as we saw, in the momentum example, it helps us weigh the more recent gradient updates more than the less recent ones. In fact, the name “exponential” comes from the fact that the weightage of previous terms falls exponentially (the most recent term is weighted as p, the next one as squared of p, then cube of p, and so on.)

Notice our diagram denoting pathological curvature, the components of the gradients along w1 are much larger than the ones along w2. Since we are squaring and adding them, they don’t cancel out, and the exponential average is large for w2 updates.

Then in the second equation, we decided our step size. We move in the direction of the gradient, but our step size is affected by the exponential average. We chose an initial learning rate eta, and then divide it by the average. In our case, since the average of w1 is much much larger than w2, the learning step for w1 is much lesser than that of w2. Hence, this will help us avoid bouncing between the ridges, and move towards the minima.

The third equation is just the update step. The hyperparameter p is generally chosen to be 0.9, but you might have to tune it. The epsilon is equation 2, is to ensure that we do not end up dividing by zero, and is generally chosen to be 1e-10.

It’s also to be noted that RMSProp implicitly performs simulated annealing. Suppose if we are heading towards the minima, and we want to slow down so as to not to overshoot the minima. RMSProp automatically will decrease the size of the gradient steps towards minima when the steps are too large (Large steps make us prone to overshooting)

Advantages

  1. RMSProp converge faster than GD or SGD, in many settings. Especially for GANs, RL, and attention-based networks.
  2. RMSProp might be able to converge to solutions with better quality (e.g. better local minima). This is based on my experience when playing with these algorithms for GANs and LSTM.
  3. Less tuning compared to GD or SGD.

Disadvantages

  1. Learning rate is still manual, because the suggested value is not always appropriate for every task.
  2. Implementation of RMSProp Descent with Employee Attrition

6. Adam optimization

Adam or Adaptive Moment Optimization algorithms combines the heuristics of both Momentum and RMSProp. Here are the update equations.

Here, we compute the exponential average of the gradient as well as the squares of the gradient for each parameters (Eq 1, and Eq 2). To decide our learning step, we multiply our learning rate by average of the gradient (as was the case with momentum) and divide it by the root mean square of the exponential average of square of gradients (as was the case with momentum) in equation 3. Then, we add the update.

The hyperparameter beta1 is generally kept around 0.9 while beta_2 is kept at 0.99. Epsilon is chosen to be 1e-10 generally.

Adam implements the exponential moving average of the gradients to scale the learning rate instead of a simple average as in Adagrad.

Adam is computationally efficient and has very less memory requirement.

Adam is so far the best optimizer that is known.

Advantages

  1. Easy to implement
  2. Quite computationally efficient
  3. Requires little memory space
  4. Good for non-stationary objectives
  5. Works well on problems with noisy or sparse gradients
  6. Works well with large data sets and large parameters
  7. Adam realizes the benefits of both AdaGrad and RMSProp.
  • Instead of adapting the parameter learning rates based on the average first moment (the mean) as in RMSProp, Adam also makes use of the average of the second moments of the gradients (the uncentered variance).
  • Specifically, the algorithm calculates an exponential moving average of the gradient and the squared gradient, and the parameters beta1 and beta2 control the decay rates of these moving averages.
  • The initial value of the moving averages and beta1 and beta2 values close to 1.0 (recommended) result in a bias of moment estimates towards zero. This bias is overcome by first calculating the biased estimates before then calculating bias-corrected estimates.

Disadvantages

There are few disadvantages as the Adam optimizer tends to converge faster, but other algorithms like the Stochastic gradient descent focus on the datapoints and generalize in a better manner. Thus, the performance depends on the type of data being provided and the speed/generalization trade-off.

When Adam was first introduced, people got very excited about its power. Paper contained some very optimistic charts, showing huge performance gains in terms of speed of training:

Then, Nadam paper presented diagrams that showed even better results:

However, after a while people started noticing that despite superior training time, Adam in some areas does not converge to an optimal solution, so for some tasks (such as image classification ) state-of-the-art results are still only achieved by applying SGD with momentum. More than that Wilson et. al [9] showed in their paper ‘The marginal value of adaptive gradient methods in machine learning’ that adaptive methods (such as Adam do not generalize as well as SGD with momentum when tested on a diverse set of deep learning tasks, discouraging people to use popular optimization algorithms. A lot of research has been done since to analyze the poor generalization of Adam trying to get it to close the gap with SGD.

Nitish Shirish Keskar and Richard Socher in their paper ‘Improving Generalization Performance by Switching from Adam to SGD’ also showed that by switching to SGD during training training they’ve been able to obtain better generalization power than when using Adam alone. They proposed a simple fix which uses a very simple idea. They’ve noticed that in earlier stages of training Adam still outperforms SGD but later the learning saturates. They proposed simple strategy which they called SWATS in which they start training deep neural network with Adam but then switch to SGD when certain criteria hits. They managed to achieve results comparable to SGD with momentum.

7. Learning rate decay

The learning rate or step size in machine learning is a hyperparameter which determines to what extent newly acquired information overrides old information. It is the most important hyper-parameter to tune for training deep neural networks. The learning rate is crucial because it controls both the speed of convergence and the ultimate performance of the network. We select learning rate mostly by trial and error, or by virtue of previous experience or some methods like LR finder. A too high learning rate will make the learning jump over minima but a too low learning rate will either take too long to converge or get stuck in undesirable local minima.

Finding a decent learning rate for a neural network is like fishing. The selection of learning rate is one of those things that makes deep learning look like magic. One of the simplest learning rate strategies is to have a fixed learning rate throughout the training process. During earlier iterations, faster learning rates lead to faster convergence while during later epochs, slower learning rate produces better accuracy. Changing the learning rate over time can overcome this tradeoff.

Schedules define how the learning rate changes over time and are typically specified for each epoch or iteration (i.e batch) of training. The main benefits of learning rate schedules are it converges faster with higher accuracy. They differ from adaptive methods (such as AdaDelta and Adam) because :

  • Schedules change the global learning rate for the optimizer, rather than parameter-wise learning rates.
  • Schedules don’t take feedback from the training process and are specified beforehand.

Types of Schedules

The theory of stochastic approximation gives us many types of schedules. But, these are not the ones that are usually used in contemporary deep learning models and frameworks. The theoretical basis of why these schedules work well is an active area of research.Here, we will be looking closely at schedules that are prominently used ones. Here, we will look at most common of these schedules:

  1. Step-wise Decay
  2. Polynomial Decay
  3. Exponential Schedule
  4. Reduce on Loss Plateau Decay
  5. Cosine Annealing
  6. Custom Schedules

1. Step-wise Decay

In step-wise decay, the learning-rate is decayed after a fixed number of steps(intervals) by a fixed factor. This fixed factor is called decay-factor, usually represented by (gamma).

  • After every epochs:
  • Learning rate at epoch:

where, is the learning rate epoch. is the decay-rate. is the step-size.

Tips

  • You would want to decay your LR gradually when you’re training more epochs.
  • Converge too fast, to a crappy loss/accuracy, if you decay rapidly
  • To decay slower
  • Larger
  • Larger interval of decay

2. Polynomial Decay

Stepwise schedules and the discontinuities they introduce may sometime lead to instability in the optimization, so in some cases smoother schedules are preferred. The learning-rate is decayed after every epoch based on a polynomial function. Polynomial Decay provides a smoother decay using a polynomial function and reaches a learning rate of 0 after max_update iterations.

The two important quantities in polynomial decay are:-

  1. max_update: number of iterations to perform before the learning rate is taken to .
  2. power: the degree of the polynomial function

Tips

  • Smaller values of power produce slower decay and large values of learning rate for longer periods.
  • For longer training, last_epoch can be increased.

3. Exponential Decay

Like the polynomial decay given above, exponential decay gives a smoother decay, solving the instability issues in step-wise scheduling. But here, the learning-rate is decayed after every epoch based on an exponential function.

The important parameters in exponential decay are:

  • last_epoch: The index of last epoch
  • : multiplicative factor of learning rate decay

Tips

  • Larger
  • Slower convergence
  • Better loss/accuracy
  • Smaller
  • Faster convergence
  • Worse loss/accuracy
  • For longer training period, last_epoch can be increased.

4. Reduce on Loss Plateau Decay

All the above decay methods like step-wise, polynomial, exponential reduce the learning-rate according to a pre-defined rule. It may change after a few steps or with every step, but the change is imminent. Consider a situation where a learning-rate value is performing well, then decaying it prematurely may not be a wise idea. Similarly continuing with a stale learning-rate value waiting for the decay step is also not helpful. All these scheduling methods do not take into consideration the position of the loss function at the moment.

So, a better idea may be to decrease the learning-rate only when the loss plateaus. This is exactly what we do in ‘Reduce on Loss Plateau Decay’. The decaying action occurs after no improvement in loss value is found. The plateau condition is checked by a fixed value called patience. Patience determines the number of epochs to wait before changing the learning-rate. For example, if patience = 2, then we will ignore the first 2 epochs with no improvement, and will only decrease the LR after the 3rd epoch if the loss still hasn’t improved then.

The two important quantities in loss plateau decay are:

  • Patience: number of epochs with no improvements after which learning rate will be reduced.
  • Factor: multiplier to decrease the learning-rate.

Tips

  • For larger number of epochs increase the value of patience
  • Loss or accuracy or any other metric can be used for finding plateau

5. Cosine Annealing

Cosine Annealing was proposed in SGDR: Stochastic Gradient Descent with Warm Restarts by Ilya Loshchilov & Frank Hutter. We will only be talking about the cosine annealing part here, we can leave out the Warm restart for a later time. In cosine annealing, we will be using the cosine function in the range . This is particularly useful for us as in the early iterations it will give us a relatively large learning rate to quickly approach a local minimum (faster convergence), and towards the end, it gives us many small learning rate iterations (better loss/accuracy).

Important parameters in cosine annealing are:-

  • min_lr: the minimum learning rate
  • max_lr: the maximum learning rate
  • cycle_length: the number of epochs to run between the maximum and minimum learning rates

Tips

  • Longer cycle length usually works better.
  • Cosine Annealing with warm restarts produces better results than vanilla cosine annealing.

6. Custom Schedules

Along with all these common LR scheduling methods, we can make our own schedules. So, let make a schedule that decays according to the function .

class LogAnnealingLR(_LRScheduler):    def __init__(self, optimizer, T_max, eta_min=0, last_epoch=-1):
self.T_max = T_max
self.eta_min = eta_min
super(LogAnnealingLR, self).__init__(optimizer, last_epoch)
def get_lr(self):
return [self.eta_min + (base_lr - self.eta_min) *
(1 + math.cos(math.pi * self.last_epoch / self.T_max)) / 2
for base_lr in self.base_lrs]

BONUS

Which optimizer to use?

So, which optimizer should you now use? If your input data is sparse, then you likely achieve the best results using one of the adaptive learning-rate methods. An additional benefit is that you won’t need to tune the learning rate but likely achieve the best results with the default value.

In summary, RMSprop is an extension of Adagrad that deals with its radically diminishing learning rates. It is identical to Adadelta, except that Adadelta uses the RMS of parameter updates in the numinator update rule. Adam, finally, adds bias-correction and momentum to RMSprop. Insofar, RMSprop, Adadelta, and Adam are very similar algorithms that do well in similar circumstances. Kingma et al. show that its bias-correction helps Adam slightly outperform RMSprop towards the end of optimization as gradients become sparser. Insofar, Adam might be the best overall choice.

Interestingly, many recent papers use vanilla SGD without momentum and a simple learning rate annealing schedule. As has been shown, SGD usually achieves to find a minimum, but it might take significantly longer than with some of the optimizers, is much more reliant on a robust initialization and annealing schedule, and may get stuck in saddle points rather than local minima. Consequently, if you care about fast convergence and train a deep or complex neural network, you should choose one of the adaptive learning rate methods.

I hope the Optimizers concept is by far clear, its the beauty of mathematics and playing around with equations which researchers spent a lot of time on. For all Optimizers now minibatch is always used. Minibatch Gradient descent solved the problem of performance and had lesser noise, while momentum reduced the noise to bring in smoothening effect. The problem of Adgrad of diminishing learning rate in deeper neural network was solved by RMSProp, and Adam by using both momentum and RMSProp became the best of optimizer right now.

Thanks for reading and happy modelling!

My GITHUB

--

--