Transformers: Architecture u0026 Core Concepts Quiz Quiz

Explore fundamental concepts and architecture details of transformer neural networks with this easy quiz. Enhance your understanding of self-attention, position encoding, and key transformer components essential for modern deep learning applications.

  1. Core Mechanism

    Which mechanism allows transformers to weigh the importance of different words in a sentence when creating word representations?

    1. Self-attention
    2. Convolution
    3. Recurrence
    4. Pooling

    Explanation: Self-attention enables the model to focus on relevant words by computing attention scores for each word pair in a sequence. Pooling is used for combining features but does not weigh relationships between words. Convolution focuses on local patterns, not global relationships. Recurrence is central to recurrent networks rather than transformer models.

  2. Layer Structure

    In a transformer model, which two main sub-layers are found within each encoder and decoder block?

    1. Linear and Dropout
    2. Softmax and Embedding
    3. Pooling and Normalization
    4. Self-attention and Feedforward

    Explanation: Transformers use a self-attention mechanism followed by a position-wise feedforward layer within each encoder and decoder block. Linear and dropout are operations, not major sub-layers. Softmax and embedding are used in different stages but are not block sub-layers. Pooling and normalization are separate components found in other network architectures.

  3. Positional Awareness

    How do transformers incorporate information about the order of words, given that they process sequences in parallel?

    1. By sequential updates
    2. By using convolutional filters
    3. By adding positional encodings
    4. By inserting extra tokens

    Explanation: Positional encodings are added to input embeddings to give the transformer information about word order. Convolutional filters operate differently and are not used for order in transformers. Inserting extra tokens is not a standard method for order. Sequential updates belong to models that process data in sequence, not in parallel.

  4. Input Representation

    What is typically the first step to prepare text input for a transformer model?

    1. Combining outputs of multiple layers
    2. Embedding words into continuous vectors
    3. Applying activation functions
    4. Splitting words into tri-grams only

    Explanation: Words are embedded into continuous vector spaces to serve as input to the transformer. Activation functions are applied later in the network, not as a preprocessing step. Combining outputs happens deeper in the model, not with raw input. While tokenization is important, splitting into tri-grams exclusively is not the standard approach.

  5. Multi-head Concept

    What is the primary benefit of using multi-head self-attention instead of a single attention head?

    1. Increasing model bias
    2. Reducing the number of parameters
    3. Simplifying computations
    4. Capturing diverse relationships in parallel

    Explanation: Multi-head self-attention enables the model to learn different types of relationships simultaneously. It does not reduce the number of parameters; in fact, it usually increases them slightly. Increasing model bias is not a benefit nor an intended outcome. Multi-head attention makes computations richer but not necessarily simpler.

  6. Encoder Role

    Which of the following best describes the main function of a transformer's encoder stack?

    1. To decode output sequences
    2. To produce context-aware representations of input tokens
    3. To perform output activation
    4. To generate final predictions directly

    Explanation: The encoder stack transforms input tokens into representations that capture their context within the sequence. It does not generate final predictions; that is typically the job of the decoder or output layer. Decoding output sequences is specific to the decoder stack, not the encoder. Output activation is usually a final mathematical operation.

  7. Decoder Masking

    Why is masking used in the decoder during transformer training for sequence tasks like text generation?

    1. To limit memory usage
    2. To encode word positions automatically
    3. To prevent the model from 'seeing' future tokens
    4. To drop random words for regularization

    Explanation: Masking ensures the decoder only attends to previous or current tokens, preventing it from accessing future information, which is crucial during generation. Dropping words is a regularization technique, but not what masking in the decoder is for. Memory usage is not directly managed by decoder masking. Encodings for word positions are handled separately with positional encoding.

  8. Normalization Purpose

    What is the reason for applying layer normalization after key sub-layers in transformer blocks?

    1. To stabilize and accelerate training
    2. To act as a pooling operation
    3. To randomize output embeddings
    4. To decrease parameter size

    Explanation: Layer normalization helps make training more stable and faster by normalizing activations. It does not decrease the number of parameters in the model. Randomizing embeddings is not a function of normalization. Pooling operations summarize inputs, which is not what layer normalization performs.

  9. Attention Weights

    In transformer self-attention, how are the importance weights for different input tokens typically computed?

    1. By calculating dot products between query and key vectors
    2. By assigning random weights
    3. By summing the embeddings directly
    4. By passing embeddings through convolution

    Explanation: The self-attention mechanism uses dot products between query and key vectors to compute attention scores. Summing embeddings would not provide meaningful importance weights. Convolution is not involved in computing attention directly. Assigning random weights is not a valid approach in this context.

  10. Output Layer

    What is commonly used as the last layer in transformer models for language prediction tasks?

    1. A linear layer followed by softmax
    2. A convolution layer
    3. An embedding layer
    4. A normalization layer

    Explanation: The output of a transformer is typically passed through a linear layer and then a softmax function for predicting probabilities over the vocabulary. Convolution is not standard at the output stage. The embedding layer is used at the input, not for final predictions. Normalization layers are used earlier to stabilize training, not for output computation.