Assess your understanding of gradient descent and optimization algorithms with questions covering core concepts, common variants, and essential terminology. Great for learners aiming to build a solid foundation in machine learning optimization techniques.
What is the primary goal of applying the gradient descent algorithm when training a model?
Explanation: The central aim of gradient descent is to minimize the loss function by iteratively updating model parameters in the direction that reduces error. Maximizing the learning rate is not the goal, as this might make the algorithm unstable. Decreasing features relates to feature selection, not optimization. Eliminating bias in predictions is more about model design, not the specific function of gradient descent.
In the context of gradient descent, what does the 'learning rate' control during the optimization process?
Explanation: The learning rate determines how large or small each parameter update will be when moving towards the minimum of the loss function. It does not control the number of iterations, the model type, or the quantity of data. Too high a learning rate can overshoot minima, while too low can slow down convergence.
Which of the following best describes batch gradient descent when training a linear regression model?
Explanation: Batch gradient descent computes the gradient of the loss with respect to the parameters using the whole dataset before making a single update. Updating after each data point describes stochastic gradient descent. Updating randomly is not standard practice. Updating only at the end of training would mean no learning occurs during training.
How does stochastic gradient descent (SGD) differ from batch gradient descent in terms of update frequency?
Explanation: SGD updates the model's parameters for each individual data point, making the process quicker and more dynamic but less stable per step. Batch gradient descent updates only after seeing the entire dataset. Saying SGD never updates or updates once per epoch is incorrect; those do not reflect typical SGD behavior.
Which optimization technique updates parameters after processing a small subset of the training data at each step?
Explanation: Mini-batch gradient descent combines aspects of both batch and stochastic methods, updating parameters after computing gradients over small data subsets. Coordinate descent optimizes one feature at a time, and Newton's method is a second-order optimization method. Batch gradient descent uses the whole dataset for each update.
What does it mean if an optimization algorithm converges to a local minimum during training?
Explanation: A local minimum is a point where the loss function value is lower than at adjacent points, though there might be a lower global minimum elsewhere. Global minimum refers to the absolute lowest possible value, which isn't guaranteed. Achieving zero error is unrelated to the definition of a local minimum, and stopping without progress does not define any minimum.
What is a likely consequence of setting the learning rate too high in gradient descent optimization?
Explanation: A too-high learning rate can cause the optimization to skip past minima, resulting in divergence or oscillation. Underfitting is usually caused by an overly simple model, not the learning rate. A very low learning rate leads to slow loss reduction. The training data size is unrelated to learning rate choices.
Why is 'momentum' often added to basic gradient descent algorithms?
Explanation: Momentum allows the optimization to build speed in beneficial directions and smooth out erratic updates, improving convergence. It doesn't reduce data size or shuffle gradients, both of which are unrelated to the algorithm's purpose. Preventing parameter updates would defeat the purpose of optimization entirely.
Which algorithm adjusts the learning rate for each parameter automatically during training?
Explanation: Adagrad and similar adaptive algorithms modify the learning rate during training based on past gradients, improving performance on diverse datasets. Standard gradient descent uses a fixed rate. Options 'Simple subtraction algorithm' and 'Global Search' are not standard optimization methods.
In machine learning optimization, what does the 'gradient' of the loss function represent?
Explanation: The gradient is made up of partial derivatives, showing how the loss changes as each parameter changes, guiding updates during optimization. The sum of loss values represents overall error, not directional change. Epoch count and accuracy are unrelated to the mathematical meaning of a gradient.