Explore the foundational concepts of Markov Decision Processes (MDPs) with this beginner-friendly quiz. Assess your understanding of states, actions, rewards, value functions, policies, and key properties of MDPs relevant to decision-making and reinforcement learning.
Which of the following is the primary purpose of a Markov Decision Process (MDP)?
Explanation: An MDP is mainly used to model problems where decisions must be made in a sequence, and the outcomes are partly random and partly under the control of a decision maker. Sorting or data encryption are unrelated to the objectives of MDPs. Random number generation is not the key purpose of MDPs, although randomness is modeled via transitions.
Which of the following is NOT a standard component of a Markov Decision Process?
Explanation: The standard components of an MDP include states, actions, rewards, transition probabilities, and a discount factor. A state-action value table, often called a Q-table, is used in algorithms like Q-learning but is not a formal part of the MDP definition. The reward function and transition probabilities are integral to the MDP, and the discount factor is commonly included.
What does the 'state' represent in a Markov Decision Process?
Explanation: A state describes the current situation or configuration of the environment in which the agent finds itself. Actions are the possible choices the agent can make, not the state itself. Rewards are outcomes, and time periods are not what states refer to in MDPs.
What is the Markov property in the context of MDPs?
Explanation: The Markov property means that the next state depends only on the present state and action, not on the sequence of states and actions before it. The other options are incorrect because state visitation and reward sign are not requirements of the Markov property, and memory of past states longer than one step violates this property.
In an MDP, what is the role of an action?
Explanation: Actions selected by the agent cause transitions between states according to certain probabilities defined in the MDP. Actions do not set the possible rewards directly or control episode length. Eliminating the Markov property is not a function of actions; in fact, actions are key to preserving it.
What does the reward function specify in a Markov Decision Process?
Explanation: The reward function defines the immediate numerical feedback that an agent receives for performing an action in a state. It does not enumerate states or dictate actions. The probability of transitions between states is not part of the reward function but of the transition function.
Why are transition probabilities important in MDPs?
Explanation: Transition probabilities define how likely it is to end up in a new state after taking an action in the current state. They do not impact memory, reward magnitude, or action history directly. These other options do not relate to the core role of transition probabilities in an MDP.
In the context of an MDP, what is a 'policy'?
Explanation: A policy defines how the agent chooses actions based on the current state. The reward function is separate, and a policy is not a list of states or outcomes. Randomizing rewards or storing full outcome histories is outside the definition of policy.
What is the purpose of the discount factor (gamma) in an MDP?
Explanation: The discount factor makes future rewards worth less than immediate rewards, affecting how much the agent values long-term gains. It does not affect the number of actions, transition probabilities, or specify the initial state.
In an MDP, what is the main difference between a finite-horizon and infinite-horizon setting?
Explanation: A finite-horizon MDP ends after a predetermined number of steps, while an infinite-horizon MDP could go on indefinitely. Both types use states, actions, and rewards, so the other options are incorrect.
What does the value function represent in the context of an MDP?
Explanation: A value function estimates how good it is to start in a certain state and follow a particular policy, in terms of expected reward. It doesn’t describe action probabilities, state counts, or action costs directly.
Which statement best describes an optimal policy in an MDP?
Explanation: An optimal policy is designed to maximize the total expected reward over time. Minimizing action count, guaranteeing visits to all states, or randomizing solely for exploration are not the criteria that define an optimal policy.
What is the goal of the policy evaluation step in MDPs?
Explanation: Policy evaluation calculates the expected returns for each state if the agent follows a certain policy. Selecting initial states, changing rewards, or measuring transition speed are not the aims of policy evaluation.
In an MDP, what typically signals the end of an episode?
Explanation: An episode ends when the agent enters a terminal state, after which no more actions are taken. Taking the same action or accumulating negative rewards does not end an episode by definition, nor is it possible to exceed the discount factor (which is a constant).
Which algorithm is commonly used to find the optimal policy in an MDP?
Explanation: Value iteration is a standard dynamic programming method for solving MDPs and finding optimal policies. Binary search, gradient boosting, and linear regression are unrelated to solving MDPs directly, as they belong to different areas in computer science and machine learning.
What is the main objective when working with Markov Decision Processes?
Explanation: The central challenge in MDPs is to discover a policy that produces the highest possible expected cumulative or total reward. Minimizing the number of states, eliminating randomness, or equalizing rewards are not fundamental goals of MDPs.