Essential SQL and Python Interview Questions Quiz

Test your knowledge of common SQL and Python interview questions, covering key concepts such as window functions, data cleaning, aggregation, and database manipulation. This quiz is ideal for candidates preparing for data analytics or data engineering roles where SQL and Python skills are required.

  1. Extracting Top N Records per Group

    Which SQL clause allows you to select the top 2 selling products in each category from a sales table, for example, using product revenue figures?

    1. ORDER BY revenue PARTITION category
    2. GROUP BY category LIMIT 2
    3. RANK() OVER (PARTITION BY category ORDER BY revenue DESC)
    4. WHERE revenue IN (SELECT TOP 2 revenue)

    Explanation: RANK() OVER (PARTITION BY category ORDER BY revenue DESC) assigns a rank within each category based on revenue, making it possible to filter for the top N records per group. 'GROUP BY category LIMIT 2' does not return two rows per group but only two groups in total. 'WHERE revenue IN (SELECT TOP 2 revenue)' incorrectly attempts to select the top two revenues globally, not per category. 'ORDER BY revenue PARTITION category' is not valid SQL syntax.

  2. Counting Null and Non-Null Values in pandas

    Which pair of pandas methods would you use to count the number of NULL and non-NULL values in each column of a DataFrame named df?

    1. nunique() and value_counts()
    2. dropna().count() and fillna().count()
    3. isna().mean() and notna().mean()
    4. isnull().sum() and notnull().sum()

    Explanation: The methods df.isnull().sum() and df.notnull().sum() return counts of null and non-null values per column, which is ideal for data quality checks. The 'dropna().count() and fillna().count()' do not count nulls directly. 'nunique() and value_counts()' are for unique values and value frequencies, not nulls. 'isna().mean() and notna().mean()' give proportions, not counts.

  3. Calculating Running Totals in SQL

    If you need to compute a running total of revenue by date in SQL, which function should you use in your SELECT statement?

    1. SUM(revenue) GROUP BY date
    2. AVG(revenue) OVER (ORDER BY date)
    3. COUNT(revenue) PARTITION BY date
    4. SUM(revenue) OVER (ORDER BY date)

    Explanation: SUM(revenue) OVER (ORDER BY date) calculates a cumulative running total ordered by date. COUNT(revenue) PARTITION BY date would count rows per date, which is different. SUM(revenue) GROUP BY date only returns totals per date, not running sums. AVG(revenue) OVER (ORDER BY date) gives running averages, not totals.

  4. Removing Duplicates from a CSV File

    To remove duplicate rows based on the 'email' column from a CSV file in pandas, which code would you use?

    1. df = df.delete_duplicates('email')
    2. df = df.drop_duplicates(subset='email')
    3. df = df.remove_duplicates(['email'])
    4. df = df.drop_duplication('email')

    Explanation: The correct method is df.drop_duplicates(subset='email'), which keeps only the first occurrence of each unique email. The options 'delete_duplicates', 'remove_duplicates', and 'drop_duplication' are not valid pandas methods and will result in errors.

  5. Finding the Highest Value Customer

    Given a transaction table, how would you identify the customer with the greatest total spend using SQL?

    1. SELECT TOP 1 customer_id, SUM(order_total) FROM orders;
    2. SELECT customer_id FROM orders WHERE order_total = MAX(order_total);
    3. SELECT customer_id, SUM(order_total) AS total_spent FROM orders GROUP BY customer_id ORDER BY total_spent DESC LIMIT 1;
    4. SELECT customer_id FROM orders GROUP BY customer_id HAVING SUM(order_total);

    Explanation: Aggregating with SUM(order_total), grouping by customer, and ordering by total_spent in descending order before limiting to one result gives the customer with the highest spend. Option 2 incorrectly tries to filter by the maximum value. Option 3 is incomplete and only valid in some SQL dialects. Option 4 misuses HAVING without a proper condition.

  6. Detecting Revenue Anomalies in SQL

    Which SQL query pattern would best identify dates where revenue is more than three times the average revenue of the previous 7 days?

    1. HAVING SUM(revenue) u003E 3 * AVG(revenue)
    2. HAVING revenue u003E 3 * AVG(revenue) OVER (ORDER BY date ROWS BETWEEN 7 PRECEDING AND 1 PRECEDING)
    3. WHERE revenue = MAX(revenue) OVER (PARTITION BY date)
    4. WHERE revenue u003E 3 * AVG(revenue) GROUP BY date

    Explanation: The use of HAVING with a window function comparing today's revenue with 3 times the average of the past 7 days flags significant surges. Option 2 misapplies GROUP BY and aggregates. Option 3 is for finding maximum revenue, not surges. Option 4 aggregates total revenue, not a moving average.

  7. Connecting and Fetching Data from PostgreSQL in Python

    Which Python library and method combination allows you to connect to a PostgreSQL database and fetch the results of a query?

    1. sqlite3 library with Cursor.read()
    2. psycopg2 library with connect() and cursor().fetchall()
    3. pyodbc library with connect() and fetch_rows()
    4. mysqlclient library with connection.query()

    Explanation: psycopg2 is the standard PostgreSQL adapter for Python, using connect() for connections and fetchall() to retrieve result sets. pyodbc is used for ODBC connections and does not have a fetch_rows() method. sqlite3 is not for PostgreSQL, and Cursor.read() does not exist. mysqlclient works with MySQL, not PostgreSQL.

  8. Finding Maximum Order Per User in SQL

    To find the maximum order value made by each user in an orders table, which SQL pattern would you use?

    1. SELECT user_id FROM orders GROUP BY user_id HAVING MAX(order_total) u003E 100;
    2. SELECT user_id, MAX(order_total) AS max_order FROM orders GROUP BY user_id;
    3. SELECT user_id, order_total FROM orders WHERE order_total = MAX(order_total);
    4. SELECT MAX(order_total) FROM orders;

    Explanation: Grouping by user_id and applying MAX gives the largest order per user. Option 2 only returns the global maximum. Option 3's use of MAX inside WHERE is incorrect SQL syntax. Option 4 uses HAVING incorrectly, as HAVING is meant for aggregate filtering, not comparison with fixed values when you want all maxes.

  9. Converting SQL Query Results to a DataFrame

    After running a SQL SELECT query on a table, how do you convert the results into a pandas DataFrame for analysis?

    1. df = pd.read_sql(query, conn)
    2. df = pd.read_query(query, connection)
    3. df = DataFrame(sql_results, as_frame=True)
    4. df = pd.read_csv(query, conn)

    Explanation: pd.read_sql(query, conn) reads the results of a SQL query directly into a pandas DataFrame. read_csv() expects a file or CSV text, not a database query. DataFrame(sql_results, as_frame=True) is not a valid constructor for this purpose. pd.read_query does not exist in pandas.

  10. Merging Two Tables and Filtering Results

    Which SQL query method lets you join 'orders' and 'users' tables on user ID and return only completed orders with user names?

    1. SELECT JOIN orders and users ON user_id WHERE status = 'Completed';
    2. SELECT * FROM orders, users WHERE orders.user_id = users.id AND status = 'Completed';
    3. SELECT o.order_id u.name orders JOIN users WHERE order_status='Completed'
    4. SELECT o.order_id, u.name FROM orders o JOIN users u ON o.user_id = u.id WHERE o.status = 'Completed';

    Explanation: The correct approach involves an explicit JOIN with ON, selecting fields from each table and filtering by status. Option 2 uses an implicit join, which is less clear and prone to errors. Options 3 and 4 are incomplete or invalid SQL statements with syntax issues.