Essential Apache Airflow Interview Questions Quiz

Dive into 15 carefully selected Apache Airflow interview questions aimed at helping data professionals and engineers review key orchestrator concepts, terminology, and practical knowledge. Master fundamentals from DAGs to operators, XComs, Airflow architecture, and practical troubleshooting strategies in modern workflow automation.

  1. Operators in Airflow

    Which statement best defines an operator in Apache Airflow, and which operator type is commonly used to execute Python code?

    1. Operators define the logic to run a specific unit of work, and PythonOperator is used for Python code.
    2. Operators refer to the people who manage the DAGs, and BashOperator is for Python code.
    3. Operators are only for scheduling tasks, and EmailOperator handles Python code.
    4. Operators are configuration files, and DummyOperator runs Python code.

    Explanation: Operators in Airflow encapsulate the logic to perform a specific activity, such as running Python code or executing bash commands. PythonOperator is specifically designed to execute Python functions. The other options misinterpret what operators are or mention incorrect operator types for Python code (e.g., BashOperator, EmailOperator, DummyOperator). Operators are not people or configuration files.

  2. DAG Concept and Importance

    What is a DAG in Airflow and why is it essential for workflow management?

    1. A DAG is a configuration file for logging errors in Airflow pipelines.
    2. A DAG is a scheduling tool for manual task execution without dependencies.
    3. A DAG describes tasks and their dependencies in a directed acyclic graph, allowing Airflow to manage workflow execution order.
    4. A DAG is a database table that stores user information used for authentication.

    Explanation: A DAG (Directed Acyclic Graph) is the core of Airflow's workflow management, representing tasks and their dependency structure. It ensures tasks run in a determined order without cycles. The other answers either refer to unrelated components like database tables, logging configs, or misunderstand the purpose of DAGs in scheduling dependencies.

  3. Defining Tasks in DAGs

    How do you define a new task within a DAG in Airflow?

    1. By instantiating an operator and assigning it to a variable within the DAG context.
    2. By creating a new database entry in the Airflow metadata table.
    3. By writing a script directly in the Airflow web UI without referencing any operator.
    4. By modifying the settings.py file and restarting the Airflow server.

    Explanation: Defining a task involves creating an operator (such as PythonOperator, BashOperator, etc.) and assigning it to a variable, usually inside a DAG definition in Python code. The other options are not standard practice: you can't define tasks directly through UI, settings.py is not used for this, and metadata table entries should not be manually created.

  4. Task Instance vs. DAG Run

    What is the main difference between a task instance and a DAG run in Apache Airflow?

    1. A task instance is a template for tasks, while a DAG run is only used for debugging.
    2. A task instance is a single execution of a task, while a DAG run is a single execution of the entire workflow.
    3. A task instance tracks all DAG runs, whereas a DAG run monitors only the failed tasks.
    4. A task instance logs errors, and a DAG run creates new database tables.

    Explanation: A task instance represents one run of a specific task in a DAG for a particular schedule or invocation, while a DAG run represents one complete run of the entire DAG. The other responses either confuse the tracking system, misuse terminology, or describe unrelated behaviors.

  5. start_date Purpose

    What is the function of the 'start_date' parameter in an Airflow DAG definition?

    1. It tells Airflow when to begin scheduling DAG runs.
    2. It marks the deadline for all tasks to complete.
    3. It determines the database expiration date.
    4. It controls the maximum number of retries for each task.

    Explanation: The 'start_date' specifies from which point in time Airflow should start creating DAG run instances. It does not set a deadline, is unrelated to database expiration dates, nor does it influence retry configuration, which is handled by other parameters.

  6. Manual DAG Trigger

    How can a user manually trigger the execution of a DAG using the Airflow web interface?

    1. By writing a script in the Airflow UI's code editor.
    2. By restarting the Airflow worker service.
    3. By clicking the 'Trigger DAG' button next to the desired DAG in the interface.
    4. By creating a new database row in the metadata table.

    Explanation: The Airflow web UI provides a 'Trigger DAG' button that users can click to start a DAG immediately. Writing scripts, modifying database rows, or restarting services is not the correct or recommended way to manually initiate DAGs.

  7. XCom in Airflow

    What purpose does XCom serve in Airflow workflows?

    1. It enables tasks to exchange small amounts of data during execution.
    2. It logs system errors during DAG execution.
    3. It manages web server connection settings.
    4. It configures the number of task retries.

    Explanation: XCom (short for 'cross-communication') lets tasks share information, such as results or states, with other tasks. It is not intended for error logging, retry configuration, or managing server connections.

  8. Protecting Database Passwords

    What is a recommended way to securely store passwords for database connections in Airflow?

    1. Write passwords directly in plain text inside DAG files.
    2. Include passwords in default_args parameters.
    3. Use Airflow's built-in connections with the passwords encrypted or stored in a secure backend.
    4. Store passwords in comments in the code.

    Explanation: Airflow supports secure storage of sensitive credentials using its connections feature and backend secret storage integrations. Writing passwords in plain text, code comments, or the default_args parameter poses a security risk.

  9. Configurable Triggers

    Which of the following is an event or trigger that can be configured to schedule DAG execution in Apache Airflow?

    1. RAM size of the worker node.
    2. Amount of disk used on the server.
    3. Number of open browser tabs.
    4. Time-based schedules using cron expressions.

    Explanation: Airflow DAGs can be scheduled using time-based intervals, often via cron syntax. The other options (RAM, browser tabs, or disk usage) are not standard triggers for scheduling DAGs in Airflow.

  10. catchup Option Purpose

    What happens when the 'catchup' parameter is set to false in a DAG's configuration?

    1. Airflow only schedules new DAG runs from the current time forward, skipping past intervals.
    2. All historical DAG runs will execute immediately upon enabling the DAG.
    3. Airflow retries failed tasks more aggressively.
    4. The DAG skips all future task executions.

    Explanation: With 'catchup=False', Airflow ignores any missed intervals before the present, running the DAG only for future scheduled times. The other choices either misinterpret the behavior or combine unrelated concepts.

  11. Airflow Scheduler Mechanism

    In simple terms, what is the main responsibility of the Airflow scheduler?

    1. It encrypts all user passwords automatically.
    2. It creates visualization charts for workflow progress.
    3. It decides when to run tasks by following the DAG schedule and triggers task execution accordingly.
    4. It stores connection information for databases.

    Explanation: The scheduler continually monitors DAGs and their schedules, determining when tasks should be started. It does not handle database connections, automated password encryption, or direct visualization features.

  12. Troubleshooting Queued Task Issues

    If a task is stuck in the 'queued' status for a long time, which action should you take first?

    1. Check the status of the worker processes to ensure they are running and able to pick up tasks.
    2. Immediately delete the DAG file from the system.
    3. Increase the DAG's schedule interval to force execution.
    4. Restart the Airflow web server only.

    Explanation: Worker processes are responsible for task execution. If tasks remain queued, workers may be unavailable or overloaded. Deleting the DAG, changing schedule interval, or just restarting the web server won't resolve issues with task execution.

  13. BrokenDag Error Causes

    Which of the following could be a likely cause of a 'BrokenDag' error in Airflow?

    1. Exceeding the number of XCom messages allowed.
    2. A web UI theme misconfiguration.
    3. A failed database connection during task execution.
    4. A syntax error or missing import in the DAG Python file.

    Explanation: 'BrokenDag' usually indicates Airflow cannot successfully parse the DAG file due to Python errors, like bad syntax or missing imports. The other choices are unrelated; runtime errors or UI customization do not trigger 'BrokenDag' errors.

  14. XCom vs Variables

    How does using XCom differ from Variables in Airflow?

    1. Both are used only for task retry logic.
    2. XCom is for saving files on disk, and Variables handle database backups.
    3. XComs are used only by the scheduler, and Variables are for workers only.
    4. XCom shares data between tasks in the same DAG run, while Variables are meant for storing global configuration values across DAGs and runs.

    Explanation: XCom allows data transfer within a DAG run between its tasks. Variables serve as key-value stores for configuration accessible across DAGs and runs. The other responses misrepresent their purposes; neither is for disk storage, task retries, backups, or restricted to certain Airflow components.

  15. Retry Parameters

    Which parameter allows you to configure how many times a task should be retried upon failure in Airflow?

    1. web_refresh
    2. project_name
    3. max_concurrent_runs
    4. retries

    Explanation: 'retries' specifies the number of retry attempts for a failed task. The other options do not control retries: 'max_concurrent_runs' manages parallelism, 'project_name' is unrelated, and 'web_refresh' does not exist in standard Airflow settings.