Essentials of Markov Decision Processes Quiz Quiz

Explore the foundational concepts of Markov Decision Processes (MDPs) with this beginner-friendly quiz. Assess your understanding of states, actions, rewards, value functions, policies, and key properties of MDPs relevant to decision-making and reinforcement learning.

  1. Definition of an MDP

    Which of the following is the primary purpose of a Markov Decision Process (MDP)?

    1. To model sequential decision-making problems under uncertainty
    2. To encrypt and decrypt messages
    3. To sort large amounts of data quickly
    4. To create random numbers for simulations

    Explanation: An MDP is mainly used to model problems where decisions must be made in a sequence, and the outcomes are partly random and partly under the control of a decision maker. Sorting or data encryption are unrelated to the objectives of MDPs. Random number generation is not the key purpose of MDPs, although randomness is modeled via transitions.

  2. MDP Components

    Which of the following is NOT a standard component of a Markov Decision Process?

    1. Discount factor
    2. Reward function
    3. State-action value table
    4. Transition probability

    Explanation: The standard components of an MDP include states, actions, rewards, transition probabilities, and a discount factor. A state-action value table, often called a Q-table, is used in algorithms like Q-learning but is not a formal part of the MDP definition. The reward function and transition probabilities are integral to the MDP, and the discount factor is commonly included.

  3. States in MDPs

    What does the 'state' represent in a Markov Decision Process?

    1. A time period between decisions
    2. A reward value received for an action
    3. A type of action the agent can perform
    4. A possible situation the agent can be in

    Explanation: A state describes the current situation or configuration of the environment in which the agent finds itself. Actions are the possible choices the agent can make, not the state itself. Rewards are outcomes, and time periods are not what states refer to in MDPs.

  4. The Markov Property

    What is the Markov property in the context of MDPs?

    1. States are determined by past three actions
    2. Future states depend only on the current state and action
    3. All states must be visited equally often
    4. Rewards are always positive values

    Explanation: The Markov property means that the next state depends only on the present state and action, not on the sequence of states and actions before it. The other options are incorrect because state visitation and reward sign are not requirements of the Markov property, and memory of past states longer than one step violates this property.

  5. Actions in MDPs

    In an MDP, what is the role of an action?

    1. To define the set of possible rewards
    2. To influence the transition from one state to another
    3. To eliminate the Markov property
    4. To determine the length of the episode

    Explanation: Actions selected by the agent cause transitions between states according to certain probabilities defined in the MDP. Actions do not set the possible rewards directly or control episode length. Eliminating the Markov property is not a function of actions; in fact, actions are key to preserving it.

  6. Reward Functions

    What does the reward function specify in a Markov Decision Process?

    1. The immediate feedback received after taking an action
    2. The next action the agent must take
    3. The total number of possible states
    4. The probability of remaining in the same state

    Explanation: The reward function defines the immediate numerical feedback that an agent receives for performing an action in a state. It does not enumerate states or dictate actions. The probability of transitions between states is not part of the reward function but of the transition function.

  7. Transition Probabilities

    Why are transition probabilities important in MDPs?

    1. They set the magnitude of rewards
    2. They determine the agent's memory capacity
    3. They track the agent’s action history
    4. They specify the likelihood of moving from one state to another given an action

    Explanation: Transition probabilities define how likely it is to end up in a new state after taking an action in the current state. They do not impact memory, reward magnitude, or action history directly. These other options do not relate to the core role of transition probabilities in an MDP.

  8. Policies in MDPs

    In the context of an MDP, what is a 'policy'?

    1. A record of all previous outcomes
    2. A mapping from states to actions
    3. A sequence of all possible states
    4. A function that randomizes rewards

    Explanation: A policy defines how the agent chooses actions based on the current state. The reward function is separate, and a policy is not a list of states or outcomes. Randomizing rewards or storing full outcome histories is outside the definition of policy.

  9. Discount Factor

    What is the purpose of the discount factor (gamma) in an MDP?

    1. It fixes the starting state
    2. It determines the present value of future rewards
    3. It sets the maximum number of actions
    4. It scales the transition probabilities

    Explanation: The discount factor makes future rewards worth less than immediate rewards, affecting how much the agent values long-term gains. It does not affect the number of actions, transition probabilities, or specify the initial state.

  10. Finite vs. Infinite Horizons

    In an MDP, what is the main difference between a finite-horizon and infinite-horizon setting?

    1. Infinite-horizon MDPs have no rewards
    2. Finite-horizon MDPs do not use actions
    3. A finite-horizon MDP terminates after a fixed number of steps
    4. Infinite-horizon MDPs do not use states

    Explanation: A finite-horizon MDP ends after a predetermined number of steps, while an infinite-horizon MDP could go on indefinitely. Both types use states, actions, and rewards, so the other options are incorrect.

  11. Value Functions

    What does the value function represent in the context of an MDP?

    1. The cost of performing each action
    2. The maximum number of states in the process
    3. The probability of an action being chosen
    4. The expected total reward from a given state following a policy

    Explanation: A value function estimates how good it is to start in a certain state and follow a particular policy, in terms of expected reward. It doesn’t describe action probabilities, state counts, or action costs directly.

  12. Optimal Policy

    Which statement best describes an optimal policy in an MDP?

    1. It randomizes actions to maximize exploration
    2. It guarantees visiting every state exactly once
    3. It maximizes the expected cumulative reward for the agent
    4. It always minimizes the number of actions taken

    Explanation: An optimal policy is designed to maximize the total expected reward over time. Minimizing action count, guaranteeing visits to all states, or randomizing solely for exploration are not the criteria that define an optimal policy.

  13. Policy Evaluation

    What is the goal of the policy evaluation step in MDPs?

    1. To measure the speed of transitions
    2. To design new reward functions
    3. To select the initial state randomly
    4. To compute the value of each state under a fixed policy

    Explanation: Policy evaluation calculates the expected returns for each state if the agent follows a certain policy. Selecting initial states, changing rewards, or measuring transition speed are not the aims of policy evaluation.

  14. Episode Termination

    In an MDP, what typically signals the end of an episode?

    1. Reaching a terminal state
    2. Choosing the same action twice in a row
    3. Accumulating negative rewards
    4. Exceeding the discount factor

    Explanation: An episode ends when the agent enters a terminal state, after which no more actions are taken. Taking the same action or accumulating negative rewards does not end an episode by definition, nor is it possible to exceed the discount factor (which is a constant).

  15. Solving MDPs

    Which algorithm is commonly used to find the optimal policy in an MDP?

    1. Binary Search
    2. Value Iteration
    3. Linear Regression
    4. Gradient Boosting

    Explanation: Value iteration is a standard dynamic programming method for solving MDPs and finding optimal policies. Binary search, gradient boosting, and linear regression are unrelated to solving MDPs directly, as they belong to different areas in computer science and machine learning.

  16. Goal of MDPs

    What is the main objective when working with Markov Decision Processes?

    1. To ensure every state returns the same reward
    2. To minimize the number of possible states
    3. To find a policy that yields the maximum expected cumulative reward
    4. To completely eliminate randomness from transitions

    Explanation: The central challenge in MDPs is to discover a policy that produces the highest possible expected cumulative or total reward. Minimizing the number of states, eliminating randomness, or equalizing rewards are not fundamental goals of MDPs.