Deep Q-Networks (DQN) Fundamentals Quiz Quiz

Explore the core concepts and essential terms behind Deep Q-Networks (DQN), a foundational algorithm in deep reinforcement learning. This quiz covers DQN principles, such as experience replay, target networks, action selection, and practical examples to reinforce your understanding.

  1. Purpose of DQN

    What is the main purpose of using a Deep Q-Network in reinforcement learning tasks such as playing video games or robotic control?

    1. To classify images into categories
    2. To approximate the optimal action-value function using a neural network
    3. To generate random actions for agents
    4. To optimize supervised learning loss functions

    Explanation: A Deep Q-Network is specifically designed to approximate the optimal action-value function, allowing the agent to estimate how good a given action is in a particular state. Generating random actions does not take advantage of learning from past experiences. DQNs are not used for image classification or optimizing supervised learning losses, which are tasks found in other machine learning domains.

  2. Experience Replay

    Why is experience replay used in Deep Q-Networks for tasks like balancing a cart-pole system?

    1. To force the agent to act randomly in each state
    2. To allow the agent to relive past experiences and break correlation between samples
    3. To optimize memory usage by deleting all old data instantly
    4. To remember only the most recent action

    Explanation: Experience replay stores and samples previous experiences to break the temporal correlations often present in sequential experiences, leading to more stable and effective training. Remembering only the most recent action disregards valuable learning information. Acting randomly does not leverage learning from experience, and deleting old data immediately would prevent learning from diverse experiences.

  3. Target Network Function

    What is the role of a target network in the DQN algorithm, especially when training on complex environments?

    1. To provide stable Q-value targets for training the main network
    2. To shuffle the action sequence randomly
    3. To absorb and delete redundant states
    4. To increase the learning rate dynamically

    Explanation: The target network in DQN is periodically updated and used to provide stable Q-value targets, which improves the stability of training. Shuffling action sequences is unrelated to network training. Adjusting the learning rate or managing redundant states are not the primary purposes of the target network.

  4. Epsilon-Greedy Policy

    In Deep Q-Networks, why is the epsilon-greedy strategy used for selecting actions during training?

    1. To always pick the highest-value action only
    2. To select actions based on supervised labels
    3. To balance exploration of new actions and exploitation of learned actions
    4. To ignore the current state and act randomly

    Explanation: The epsilon-greedy policy encourages exploration with probability epsilon, allowing the agent to sometimes select random actions, while mostly exploiting the best-known actions. Always picking the highest-value action would limit learning from diverse states. Pure random selection ignores learning, and using supervised labels is not relevant to reinforcement learning.

  5. Deep Q-Network Output

    What does the output of a Deep Q-Network represent for a given input state vector?

    1. The agent’s position in the environment
    2. The sum of all previous rewards only
    3. Estimated Q-values for each possible action in that state
    4. Predicted class probabilities for labels

    Explanation: A DQN outputs estimated Q-values for every possible action the agent can take in a given state, which guide action selection. Predicting class probabilities is characteristic of classification tasks. Outputting the agent’s position or the sum of rewards does not align with the DQN’s functionality.

  6. Bellman Equation Use

    How is the Bellman equation utilized in training Deep Q-Networks during learning episodes?

    1. To generate synthetic training data
    2. To store and retrieve neural network weights
    3. To calculate transition probabilities between environments
    4. To update the estimate of the state-action value based on current rewards and future predictions

    Explanation: The Bellman equation provides a recursive relationship to update Q-value estimates by considering immediate rewards and the estimated value of future states. Transition probabilities may be used in some reinforcement learning approaches but are not the primary focus. Neural network weights and data generation are not directly related to the Bellman equation’s use in DQNs.

  7. Overestimation Problem

    Which problem can occur when the same network is used to select and evaluate actions in a DQN, especially without a target network?

    1. Underfitting of reward signals
    2. Guaranteed optimal computation
    3. Excessive memory usage
    4. Overestimation of Q-values

    Explanation: Using the same network for selection and evaluation can lead to overoptimistic Q-value estimates due to maximization bias, especially in noisy environments. Underfitting refers to poor learning or model complexity, not specific to this mechanism. High memory usage or guaranteeing optimality are not direct consequences of action evaluation with the same network.

  8. Markov Decision Process (MDP)

    What essential property of the environment is usually assumed in Deep Q-Networks, making it suitable for MDPs?

    1. The next state depends only on the current state and action
    2. Every action always leads to the same state
    3. The agent relies on supervised training labels
    4. State transitions are completely unpredictable

    Explanation: Markov Decision Processes assume the Markov property, where the next state depends solely on the current state and action, fitting the DQN learning structure. Completely unpredictable transitions or fixed action results do not align with typical MDPs. Reliance on supervised labels is not relevant to reinforcement learning or DQNs.

  9. Experience Tuple Elements

    Which elements are typically stored in one entry of the experience replay memory in a DQN?

    1. Current observation, output label, batch number, learning rate
    2. All possible network weights for the episode
    3. Agent’s current score, future reward predictions only
    4. Current state, action taken, reward received, next state, done flag

    Explanation: Experience replay entries usually consist of the current state, action, received reward, resulting next state, and a done flag to indicate episode termination. Scores and future predictions are not part of individual experiences. Storing network weights or unrelated training parameters does not align with experience replay.

  10. DQN's Input Type

    In Deep Q-Networks applied to environments like maze navigation, what is usually provided as the input to the neural network?

    1. A representation of the current environment state
    2. Randomly generated noise vectors
    3. The reward from the previous episode
    4. The agent’s list of past actions only

    Explanation: DQNs require a suitable representation of the current environment state, such as a position in a maze or an image of a game screen, as input. Past action lists, random noise, or previous rewards do not provide the required context for the DQN to determine optimal actions.