Understanding LSTM Architecture
In the architecture of an LSTM cell, which component is primarily responsible for regulating the information that should be discarded from the cell state at each time step, and how does it determine this?
- A. The output gate, using a ReLU activation function to determine which outputs to ignore
- B. The input gate, using a sigmoid activation to decide what to discard
- C. The forget gate, using a sigmoid activation function to produce values for selective information removal
- D. The candidate gate, using a tanh activation to remove previous information
- E. The memory gate, using a softmax function to scale the cell state
Mechanisms Combating Vanishing Gradients
LSTM networks were developed to address the vanishing gradient problem often found in traditional RNNs. What internal mechanism do LSTM layers employ that primarily mitigates this issue and allows them to learn long-term dependencies more effectively?
- A. Adding extra hidden layers with LeakyReLU activations
- B. Utilizing explicit cell states and gating mechanisms to control gradient flow
- C. Applying batch normalization after every time step
- D. Replacing tanh activations with hard sigmoid
- E. Utilizing only output gates to accumulate past information
Interpreting State Vectors in LSTM
Consider an LSTM cell receiving a sequence of inputs where the hidden state and cell state vectors evolve over time. Which statement most accurately describes the core distinction between these two vectors within the same LSTM cell at a given time step?
- A. The hidden state contains short-term sequential information while the cell state accumulates long-term information
- B. The hidden state stores raw input values, and the cell state applies non-linear transformation to those inputs
- C. Both vectors represent the same data but with different dimensionality
- D. The cell state outputs the final prediction, while the hidden state is only used for memory retention
- E. The cell state is computed by concatenating all previous hidden states
Technical Details of Gate Operations
Given an LSTM cell performing a sequence update, how is the candidate value for updating the cell state typically generated and which activation function is used for this purpose?
- A. By applying a tanh activation to a linear transformation of the inputs and previous hidden state
- B. By applying a sigmoid activation directly to the cell state
- C. By using a softmax activation on the previous cell state
- D. By applying a linear transformation to only the inputs, with no activation
- E. By multiplying the previous cell state by the input gate
LSTM Regularization Strategies
When using LSTM networks for practical tasks such as text generation, which method is commonly applied to prevent overfitting during training, and how does it function in this context?
- A. Early stopping, by terminating training once validation loss increases
- B. Dropout, by randomly zeroing out a fraction of hidden units at each training step
- C. Weight tying, by sharing weights between all gates
- D. Gradient clipping, by limiting the maximum value of gradients
- E. Max pooling, by pooling hidden states at each time step