Explore fundamental concepts and architecture details of transformer neural networks with this easy quiz. Enhance your understanding of self-attention, position encoding, and key transformer components essential for modern deep learning applications.
Which mechanism allows transformers to weigh the importance of different words in a sentence when creating word representations?
Explanation: Self-attention enables the model to focus on relevant words by computing attention scores for each word pair in a sequence. Pooling is used for combining features but does not weigh relationships between words. Convolution focuses on local patterns, not global relationships. Recurrence is central to recurrent networks rather than transformer models.
In a transformer model, which two main sub-layers are found within each encoder and decoder block?
Explanation: Transformers use a self-attention mechanism followed by a position-wise feedforward layer within each encoder and decoder block. Linear and dropout are operations, not major sub-layers. Softmax and embedding are used in different stages but are not block sub-layers. Pooling and normalization are separate components found in other network architectures.
How do transformers incorporate information about the order of words, given that they process sequences in parallel?
Explanation: Positional encodings are added to input embeddings to give the transformer information about word order. Convolutional filters operate differently and are not used for order in transformers. Inserting extra tokens is not a standard method for order. Sequential updates belong to models that process data in sequence, not in parallel.
What is typically the first step to prepare text input for a transformer model?
Explanation: Words are embedded into continuous vector spaces to serve as input to the transformer. Activation functions are applied later in the network, not as a preprocessing step. Combining outputs happens deeper in the model, not with raw input. While tokenization is important, splitting into tri-grams exclusively is not the standard approach.
What is the primary benefit of using multi-head self-attention instead of a single attention head?
Explanation: Multi-head self-attention enables the model to learn different types of relationships simultaneously. It does not reduce the number of parameters; in fact, it usually increases them slightly. Increasing model bias is not a benefit nor an intended outcome. Multi-head attention makes computations richer but not necessarily simpler.
Which of the following best describes the main function of a transformer's encoder stack?
Explanation: The encoder stack transforms input tokens into representations that capture their context within the sequence. It does not generate final predictions; that is typically the job of the decoder or output layer. Decoding output sequences is specific to the decoder stack, not the encoder. Output activation is usually a final mathematical operation.
Why is masking used in the decoder during transformer training for sequence tasks like text generation?
Explanation: Masking ensures the decoder only attends to previous or current tokens, preventing it from accessing future information, which is crucial during generation. Dropping words is a regularization technique, but not what masking in the decoder is for. Memory usage is not directly managed by decoder masking. Encodings for word positions are handled separately with positional encoding.
What is the reason for applying layer normalization after key sub-layers in transformer blocks?
Explanation: Layer normalization helps make training more stable and faster by normalizing activations. It does not decrease the number of parameters in the model. Randomizing embeddings is not a function of normalization. Pooling operations summarize inputs, which is not what layer normalization performs.
In transformer self-attention, how are the importance weights for different input tokens typically computed?
Explanation: The self-attention mechanism uses dot products between query and key vectors to compute attention scores. Summing embeddings would not provide meaningful importance weights. Convolution is not involved in computing attention directly. Assigning random weights is not a valid approach in this context.
What is commonly used as the last layer in transformer models for language prediction tasks?
Explanation: The output of a transformer is typically passed through a linear layer and then a softmax function for predicting probabilities over the vocabulary. Convolution is not standard at the output stage. The embedding layer is used at the input, not for final predictions. Normalization layers are used earlier to stabilize training, not for output computation.