Test your foundational knowledge of SQL joins, group-by aggregations, and per-entity statistics to enhance feature engineering in relational data. This quiz covers key concepts like join types, aggregate functions, and techniques for generating features such as entity counts and group-wise summaries.
Which SQL join includes all rows from the left table and only the matching rows from the right table when enriching customer records with their purchase data?
Explanation: A LEFT JOIN returns all records from the left table (including those without matches in the right table), which is useful when you want every customer even if some have no purchases. An INNER JOIN only gives records with matching rows in both tables, so customers without purchases would be excluded. RIGHT JOIN returns all records from the right table regardless of matches in the left, which would not fit this scenario. CROSS JOIN creates all possible combinations, which is generally not used for feature enrichment.
Which SQL aggregate function is best used to count the number of orders each user has placed?
Explanation: COUNT is used to count the number of non-null rows or values, making it ideal for counting orders per user. SUM adds numeric values, not the number of records. AVG calculates the average, not the count, and MIN returns the minimum value, not the total number of occurrences.
What SQL clause is used to group purchase data by user_id so you can calculate total spending per user?
Explanation: GROUP BY groups rows by specified columns such as user_id, allowing you to apply aggregate functions like SUM for feature creation. ORDER BY only sorts results and does not aggregate. WHERE filters rows before applying aggregations, and HAVING filters groups after aggregation.
What happens with an INNER JOIN between a users table and a transactions table if a user has no transactions?
Explanation: INNER JOIN only returns rows where there are matches in both tables, so users without transactions are omitted. LEFT JOIN would include the user with NULL transaction values. It will not automatically display zeros for missing transactions unless handled in the query. No error occurs; it's valid to have mismatched rows.
To calculate the average transaction amount per customer, which SQL function should be used with GROUP BY?
Explanation: AVG returns the average of numeric values, which is exactly what is needed for average transaction amount. MIN returns the lowest value, not the average. CONCAT combines text values, not appropriate here. JOIN is not a function, but a keyword for linking tables.
Which join type combines two tables by showing only rows where there are matching keys in both, such as matching students to their classes?
Explanation: INNER JOIN only returns rows with matching keys in both tables, which ensures only students enrolled in classes appear. FULL JOIN includes all rows from both tables regardless of matches. LEFT JOIN includes all from the left table. CROSS JOIN creates all possible combinations, not filtered by keys.
If you want to get both total and average sales per store in a sales table, which aggregate functions can you combine with GROUP BY?
Explanation: SUM calculates total sales, while AVG computes the average, making them suitable to combine with GROUP BY for these features. COUNT tells you the number of sales, not the totals or averages. MIN and MAX show the extremes, not the sums or averages. JOIN is not an aggregate function.
Which SQL clause is used to filter groups after aggregation, such as returning only users with more than three logins?
Explanation: HAVING lets you filter based on the result of aggregate functions, like 'COUNT(*) u003E 3'. WHERE cannot be used with aggregated results. ORDER BY sorts the results, and GROUP is not a valid clause on its own.
To count the number of unique items each user purchased, which SQL expression is appropriate?
Explanation: COUNT(DISTINCT item_id) counts the number of unique item_ids for each user. SUM(UNIQUE ...) is invalid because SUM adds values and UNIQUE is not an SQL keyword. COUNT(ALL ...) is redundant as COUNT by itself counts all non-null values, not unique ones. DISTINCT COUNT(...) is incorrect syntax.
Which query would produce a feature showing the highest score achieved by each player from a games table?
Explanation: Using MAX(score) with GROUP BY player_id delivers the highest score for each player. SUM(score) gives the total, not the maximum. The other queries group by score, which is not useful for per-player stats, and have incorrect aggregate logic.
What values will appear in the columns from the left table if there is no match in the right table when using a RIGHT JOIN?
Explanation: RIGHT JOIN returns NULL in columns from the left table if there's no match with the right table row. Zero is not automatically substituted unless handled in the query. Empty strings also do not appear unless it's a string column with a default, and duplicate key is unrelated to join behavior.
When joining sales and inventory tables on both product_id and store_id, what must you include in your ON clause?
Explanation: The ON clause should match both product_id and store_id to ensure correct rows are joined from each table. Using only one key may result in incorrect matches. GROUP BY is used after JOIN for aggregations, not for specifying join logic.
Which situation is a CROSS JOIN most likely to be useful for in feature engineering?
Explanation: CROSS JOIN creates all possible pairings between rows of two tables, which might be useful for feature construction in some scenarios. It does not filter nulls, aggregate values, or restrict records to only matches (as INNER JOIN does).
How can you create both per-user total and average purchase value in a single SQL query from a purchases table?
Explanation: You can select SUM(amount) and AVG(amount) together while grouping by user_id to get total and average values per user. LEFT JOIN is for combining tables, not aggregating. COUNT and DISTINCT can count users but do not give total or average purchases. ORDER BY only sorts and doesn't aggregate.
After calculating per-category item counts using GROUP BY category_id, which clause would you use to display the categories with the most items at the top?
Explanation: ORDER BY count DESC puts the highest counts first, which is ideal for sorting categories. WHERE doesn't work after aggregation for this purpose. HAVING filters groups but doesn’t sort. GROUP BY count is not valid; you group by a column, not by an aggregated value.
In feature engineering with joins, how can you avoid duplicate rows when joining user and event tables if users have multiple events?
Explanation: Aggregating event data with GROUP BY before joining ensures one row per user and prevents duplication in the output. CROSS JOIN would increase duplicates rather than reduce them. DISTINCT may remove duplicates, but won't address the source of redundancy if multiple event rows exist. Joining on non-key columns might create more duplicates rather than prevent them.