Stochastic, Batch and Mini Batch Gradient Descent

Tapas Mahanta
3 min readMar 22, 2020

--

Linear regression cost function

Stochastic, Batch and Mini Batch Gradient Descent

Batch Gradient Descent

In Batch Gradient Descent, all the training data is taken into consideration to take a single step. We take the average of the gradients of all the training examples and then use that mean gradient to update our parameters. So that’s just one step of gradient descent in one epoch/iteration.

Because Batch gradient descent computes the gradient using the whole dataset. This is great for convex or relatively smooth error functions. In this case, we move somewhat directly towards an optimum solution, either local or global.

Cost vs Epochs (Source: https://www.bogotobogo.com/python/scikit-learn/scikit-learn_batch-gradient-descent-versus-stochastic-gradient-descent.php)

The graph of cost vs epochs is also quite smooth because we are averaging over all the gradients of training data for a single step. The cost keeps on decreasing over the epochs.

Stochastic Gradient Descent

we consider just one example at a time to take a single step. We do the following steps in one epoch for SGD:

Because single samples are really noisy, in SGD cost will fluctuate over the training examples and it will not necessarily decrease. In the long run, it will eventually decrease but as the cost is so fluctuating, it will never reach the minima but it will keep dancing around it.

SGD works better than batch gradient descent for error functions that have lots of local maxima/minima as the somewhat noisier gradient tends to jerk the model out of local minima into a region that hopefully is more optimal.

The amount of jerk is reduced when using mini-batches. A good balance is struck when the minibatch size is small enough to avoid some of the poor local minima but large enough that it doesn’t avoid the global minima.

Batch Gradient Descent can be used for smoother curves. SGD can be used when the dataset is large

Mini Batch gradient descent

Neither we use all the dataset all at once nor we use a single example at a time. We use a batch of a fixed number of training examples which is less than the actual dataset and call it a mini-batch. Doing this helps us achieve the advantages of both the former variants we saw.

Mini-batch gradient descent uses n data points (instead of 1 sample in SGD) at each iteration.

--

--

No responses yet