Dive into 15 carefully selected Apache Airflow interview questions aimed at helping data professionals and engineers review key orchestrator concepts, terminology, and practical knowledge. Master fundamentals from DAGs to operators, XComs, Airflow architecture, and practical troubleshooting strategies in modern workflow automation.
Which statement best defines an operator in Apache Airflow, and which operator type is commonly used to execute Python code?
Explanation: Operators in Airflow encapsulate the logic to perform a specific activity, such as running Python code or executing bash commands. PythonOperator is specifically designed to execute Python functions. The other options misinterpret what operators are or mention incorrect operator types for Python code (e.g., BashOperator, EmailOperator, DummyOperator). Operators are not people or configuration files.
What is a DAG in Airflow and why is it essential for workflow management?
Explanation: A DAG (Directed Acyclic Graph) is the core of Airflow's workflow management, representing tasks and their dependency structure. It ensures tasks run in a determined order without cycles. The other answers either refer to unrelated components like database tables, logging configs, or misunderstand the purpose of DAGs in scheduling dependencies.
How do you define a new task within a DAG in Airflow?
Explanation: Defining a task involves creating an operator (such as PythonOperator, BashOperator, etc.) and assigning it to a variable, usually inside a DAG definition in Python code. The other options are not standard practice: you can't define tasks directly through UI, settings.py is not used for this, and metadata table entries should not be manually created.
What is the main difference between a task instance and a DAG run in Apache Airflow?
Explanation: A task instance represents one run of a specific task in a DAG for a particular schedule or invocation, while a DAG run represents one complete run of the entire DAG. The other responses either confuse the tracking system, misuse terminology, or describe unrelated behaviors.
What is the function of the 'start_date' parameter in an Airflow DAG definition?
Explanation: The 'start_date' specifies from which point in time Airflow should start creating DAG run instances. It does not set a deadline, is unrelated to database expiration dates, nor does it influence retry configuration, which is handled by other parameters.
How can a user manually trigger the execution of a DAG using the Airflow web interface?
Explanation: The Airflow web UI provides a 'Trigger DAG' button that users can click to start a DAG immediately. Writing scripts, modifying database rows, or restarting services is not the correct or recommended way to manually initiate DAGs.
What purpose does XCom serve in Airflow workflows?
Explanation: XCom (short for 'cross-communication') lets tasks share information, such as results or states, with other tasks. It is not intended for error logging, retry configuration, or managing server connections.
What is a recommended way to securely store passwords for database connections in Airflow?
Explanation: Airflow supports secure storage of sensitive credentials using its connections feature and backend secret storage integrations. Writing passwords in plain text, code comments, or the default_args parameter poses a security risk.
Which of the following is an event or trigger that can be configured to schedule DAG execution in Apache Airflow?
Explanation: Airflow DAGs can be scheduled using time-based intervals, often via cron syntax. The other options (RAM, browser tabs, or disk usage) are not standard triggers for scheduling DAGs in Airflow.
What happens when the 'catchup' parameter is set to false in a DAG's configuration?
Explanation: With 'catchup=False', Airflow ignores any missed intervals before the present, running the DAG only for future scheduled times. The other choices either misinterpret the behavior or combine unrelated concepts.
In simple terms, what is the main responsibility of the Airflow scheduler?
Explanation: The scheduler continually monitors DAGs and their schedules, determining when tasks should be started. It does not handle database connections, automated password encryption, or direct visualization features.
If a task is stuck in the 'queued' status for a long time, which action should you take first?
Explanation: Worker processes are responsible for task execution. If tasks remain queued, workers may be unavailable or overloaded. Deleting the DAG, changing schedule interval, or just restarting the web server won't resolve issues with task execution.
Which of the following could be a likely cause of a 'BrokenDag' error in Airflow?
Explanation: 'BrokenDag' usually indicates Airflow cannot successfully parse the DAG file due to Python errors, like bad syntax or missing imports. The other choices are unrelated; runtime errors or UI customization do not trigger 'BrokenDag' errors.
How does using XCom differ from Variables in Airflow?
Explanation: XCom allows data transfer within a DAG run between its tasks. Variables serve as key-value stores for configuration accessible across DAGs and runs. The other responses misrepresent their purposes; neither is for disk storage, task retries, backups, or restricted to certain Airflow components.
Which parameter allows you to configure how many times a task should be retried upon failure in Airflow?
Explanation: 'retries' specifies the number of retry attempts for a failed task. The other options do not control retries: 'max_concurrent_runs' manages parallelism, 'project_name' is unrelated, and 'web_refresh' does not exist in standard Airflow settings.