SQL Joins and Aggregation Essentials for Feature Engineering Quiz

Test your foundational knowledge of SQL joins, group-by aggregations, and per-entity statistics to enhance feature engineering in relational data. This quiz covers key concepts like join types, aggregate functions, and techniques for generating features such as entity counts and group-wise summaries.

  1. Identifying the Correct Join for Feature Enrichment

    Which SQL join includes all rows from the left table and only the matching rows from the right table when enriching customer records with their purchase data?

    1. LEFT JOIN
    2. CROSS JOIN
    3. INNER JOIN
    4. RIGHT JOIN

    Explanation: A LEFT JOIN returns all records from the left table (including those without matches in the right table), which is useful when you want every customer even if some have no purchases. An INNER JOIN only gives records with matching rows in both tables, so customers without purchases would be excluded. RIGHT JOIN returns all records from the right table regardless of matches in the left, which would not fit this scenario. CROSS JOIN creates all possible combinations, which is generally not used for feature enrichment.

  2. Selecting the Basic Aggregate Function for Counting

    Which SQL aggregate function is best used to count the number of orders each user has placed?

    1. MIN
    2. COUNT
    3. SUM
    4. AVG

    Explanation: COUNT is used to count the number of non-null rows or values, making it ideal for counting orders per user. SUM adds numeric values, not the number of records. AVG calculates the average, not the count, and MIN returns the minimum value, not the total number of occurrences.

  3. Applying Group-By for Per-Entity Summaries

    What SQL clause is used to group purchase data by user_id so you can calculate total spending per user?

    1. WHERE
    2. GROUP BY
    3. HAVING
    4. ORDER BY

    Explanation: GROUP BY groups rows by specified columns such as user_id, allowing you to apply aggregate functions like SUM for feature creation. ORDER BY only sorts results and does not aggregate. WHERE filters rows before applying aggregations, and HAVING filters groups after aggregation.

  4. Understanding INNER JOIN Behavior

    What happens with an INNER JOIN between a users table and a transactions table if a user has no transactions?

    1. The user will appear with NULL in transaction columns.
    2. The user will be excluded from the result set.
    3. An error will occur in the query.
    4. A row will show with zeros for missing transaction values.

    Explanation: INNER JOIN only returns rows where there are matches in both tables, so users without transactions are omitted. LEFT JOIN would include the user with NULL transaction values. It will not automatically display zeros for missing transactions unless handled in the query. No error occurs; it's valid to have mismatched rows.

  5. Choosing a Function for Average Calculation

    To calculate the average transaction amount per customer, which SQL function should be used with GROUP BY?

    1. CONCAT
    2. MIN
    3. JOIN
    4. AVG

    Explanation: AVG returns the average of numeric values, which is exactly what is needed for average transaction amount. MIN returns the lowest value, not the average. CONCAT combines text values, not appropriate here. JOIN is not a function, but a keyword for linking tables.

  6. Identifying Join Type for Selecting All Matching Rows

    Which join type combines two tables by showing only rows where there are matching keys in both, such as matching students to their classes?

    1. CROSS JOIN
    2. FULL JOIN
    3. LEFT JOIN
    4. INNER JOIN

    Explanation: INNER JOIN only returns rows with matching keys in both tables, which ensures only students enrolled in classes appear. FULL JOIN includes all rows from both tables regardless of matches. LEFT JOIN includes all from the left table. CROSS JOIN creates all possible combinations, not filtered by keys.

  7. Aggregating Multiple Numeric Columns Per Group

    If you want to get both total and average sales per store in a sales table, which aggregate functions can you combine with GROUP BY?

    1. SUM and AVG
    2. AVG and JOIN
    3. COUNT and CONCAT
    4. MIN and MAX

    Explanation: SUM calculates total sales, while AVG computes the average, making them suitable to combine with GROUP BY for these features. COUNT tells you the number of sales, not the totals or averages. MIN and MAX show the extremes, not the sums or averages. JOIN is not an aggregate function.

  8. Filtering Groups Based on Aggregated Values

    Which SQL clause is used to filter groups after aggregation, such as returning only users with more than three logins?

    1. WHERE
    2. HAVING
    3. GROUP
    4. ORDER BY

    Explanation: HAVING lets you filter based on the result of aggregate functions, like 'COUNT(*) u003E 3'. WHERE cannot be used with aggregated results. ORDER BY sorts the results, and GROUP is not a valid clause on its own.

  9. Correct Syntax for Counting Distinct Values

    To count the number of unique items each user purchased, which SQL expression is appropriate?

    1. COUNT(DISTINCT item_id)
    2. DISTINCT COUNT(item_id)
    3. SUM(UNIQUE item_id)
    4. COUNT(ALL item_id)

    Explanation: COUNT(DISTINCT item_id) counts the number of unique item_ids for each user. SUM(UNIQUE ...) is invalid because SUM adds values and UNIQUE is not an SQL keyword. COUNT(ALL ...) is redundant as COUNT by itself counts all non-null values, not unique ones. DISTINCT COUNT(...) is incorrect syntax.

  10. Using Aggregation to Create Entity Features

    Which query would produce a feature showing the highest score achieved by each player from a games table?

    1. SELECT score, MIN(player_id) FROM games GROUP BY score;
    2. SELECT player_id, MAX(score) FROM games GROUP BY player_id;
    3. SELECT MAX(player_id) FROM games GROUP BY score;
    4. SELECT player_id, SUM(score) FROM games GROUP BY player_id;

    Explanation: Using MAX(score) with GROUP BY player_id delivers the highest score for each player. SUM(score) gives the total, not the maximum. The other queries group by score, which is not useful for per-player stats, and have incorrect aggregate logic.

  11. Non-Matching Rows in RIGHT JOIN

    What values will appear in the columns from the left table if there is no match in the right table when using a RIGHT JOIN?

    1. Empty string
    2. NULL
    3. 0
    4. Duplicate key

    Explanation: RIGHT JOIN returns NULL in columns from the left table if there's no match with the right table row. Zero is not automatically substituted unless handled in the query. Empty strings also do not appear unless it's a string column with a default, and duplicate key is unrelated to join behavior.

  12. Joining on Multiple Keys

    When joining sales and inventory tables on both product_id and store_id, what must you include in your ON clause?

    1. Only store_id condition
    2. Only product_id condition
    3. A GROUP BY clause
    4. Both product_id and store_id conditions

    Explanation: The ON clause should match both product_id and store_id to ensure correct rows are joined from each table. Using only one key may result in incorrect matches. GROUP BY is used after JOIN for aggregations, not for specifying join logic.

  13. Purpose of CROSS JOIN in Feature Engineering

    Which situation is a CROSS JOIN most likely to be useful for in feature engineering?

    1. Filtering null data from a table
    2. Generating all possible combinations between entities
    3. Selecting only matching records
    4. Aggregating total sales per region

    Explanation: CROSS JOIN creates all possible pairings between rows of two tables, which might be useful for feature construction in some scenarios. It does not filter nulls, aggregate values, or restrict records to only matches (as INNER JOIN does).

  14. Creating a Feature With Multiple Aggregations

    How can you create both per-user total and average purchase value in a single SQL query from a purchases table?

    1. By filtering with ORDER BY amount DESC
    2. By grouping by user_id and selecting SUM(amount) and AVG(amount)
    3. By selecting COUNT(user_id) and DISTINCT user_id
    4. By using LEFT JOIN twice on user_id

    Explanation: You can select SUM(amount) and AVG(amount) together while grouping by user_id to get total and average values per user. LEFT JOIN is for combining tables, not aggregating. COUNT and DISTINCT can count users but do not give total or average purchases. ORDER BY only sorts and doesn't aggregate.

  15. Aggregating and Sorting Results

    After calculating per-category item counts using GROUP BY category_id, which clause would you use to display the categories with the most items at the top?

    1. ORDER BY count DESC
    2. GROUP BY count
    3. WHERE count u003E 0
    4. HAVING category_id IS NOT NULL

    Explanation: ORDER BY count DESC puts the highest counts first, which is ideal for sorting categories. WHERE doesn't work after aggregation for this purpose. HAVING filters groups but doesn’t sort. GROUP BY count is not valid; you group by a column, not by an aggregated value.

  16. Preventing Duplicate Rows After Joining

    In feature engineering with joins, how can you avoid duplicate rows when joining user and event tables if users have multiple events?

    1. Use CROSS JOIN instead
    2. Add a DISTINCT clause after the JOIN
    3. Aggregate event data with GROUP BY before joining
    4. Join on non-key columns only

    Explanation: Aggregating event data with GROUP BY before joining ensures one row per user and prevents duplication in the output. CROSS JOIN would increase duplicates rather than reduce them. DISTINCT may remove duplicates, but won't address the source of redundancy if multiple event rows exist. Joining on non-key columns might create more duplicates rather than prevent them.