Test your knowledge of common SQL and Python interview questions, covering key concepts such as window functions, data cleaning, aggregation, and database manipulation. This quiz is ideal for candidates preparing for data analytics or data engineering roles where SQL and Python skills are required.
Which SQL clause allows you to select the top 2 selling products in each category from a sales table, for example, using product revenue figures?
Explanation: RANK() OVER (PARTITION BY category ORDER BY revenue DESC) assigns a rank within each category based on revenue, making it possible to filter for the top N records per group. 'GROUP BY category LIMIT 2' does not return two rows per group but only two groups in total. 'WHERE revenue IN (SELECT TOP 2 revenue)' incorrectly attempts to select the top two revenues globally, not per category. 'ORDER BY revenue PARTITION category' is not valid SQL syntax.
Which pair of pandas methods would you use to count the number of NULL and non-NULL values in each column of a DataFrame named df?
Explanation: The methods df.isnull().sum() and df.notnull().sum() return counts of null and non-null values per column, which is ideal for data quality checks. The 'dropna().count() and fillna().count()' do not count nulls directly. 'nunique() and value_counts()' are for unique values and value frequencies, not nulls. 'isna().mean() and notna().mean()' give proportions, not counts.
If you need to compute a running total of revenue by date in SQL, which function should you use in your SELECT statement?
Explanation: SUM(revenue) OVER (ORDER BY date) calculates a cumulative running total ordered by date. COUNT(revenue) PARTITION BY date would count rows per date, which is different. SUM(revenue) GROUP BY date only returns totals per date, not running sums. AVG(revenue) OVER (ORDER BY date) gives running averages, not totals.
To remove duplicate rows based on the 'email' column from a CSV file in pandas, which code would you use?
Explanation: The correct method is df.drop_duplicates(subset='email'), which keeps only the first occurrence of each unique email. The options 'delete_duplicates', 'remove_duplicates', and 'drop_duplication' are not valid pandas methods and will result in errors.
Given a transaction table, how would you identify the customer with the greatest total spend using SQL?
Explanation: Aggregating with SUM(order_total), grouping by customer, and ordering by total_spent in descending order before limiting to one result gives the customer with the highest spend. Option 2 incorrectly tries to filter by the maximum value. Option 3 is incomplete and only valid in some SQL dialects. Option 4 misuses HAVING without a proper condition.
Which SQL query pattern would best identify dates where revenue is more than three times the average revenue of the previous 7 days?
Explanation: The use of HAVING with a window function comparing today's revenue with 3 times the average of the past 7 days flags significant surges. Option 2 misapplies GROUP BY and aggregates. Option 3 is for finding maximum revenue, not surges. Option 4 aggregates total revenue, not a moving average.
Which Python library and method combination allows you to connect to a PostgreSQL database and fetch the results of a query?
Explanation: psycopg2 is the standard PostgreSQL adapter for Python, using connect() for connections and fetchall() to retrieve result sets. pyodbc is used for ODBC connections and does not have a fetch_rows() method. sqlite3 is not for PostgreSQL, and Cursor.read() does not exist. mysqlclient works with MySQL, not PostgreSQL.
To find the maximum order value made by each user in an orders table, which SQL pattern would you use?
Explanation: Grouping by user_id and applying MAX gives the largest order per user. Option 2 only returns the global maximum. Option 3's use of MAX inside WHERE is incorrect SQL syntax. Option 4 uses HAVING incorrectly, as HAVING is meant for aggregate filtering, not comparison with fixed values when you want all maxes.
After running a SQL SELECT query on a table, how do you convert the results into a pandas DataFrame for analysis?
Explanation: pd.read_sql(query, conn) reads the results of a SQL query directly into a pandas DataFrame. read_csv() expects a file or CSV text, not a database query. DataFrame(sql_results, as_frame=True) is not a valid constructor for this purpose. pd.read_query does not exist in pandas.
Which SQL query method lets you join 'orders' and 'users' tables on user ID and return only completed orders with user names?
Explanation: The correct approach involves an explicit JOIN with ON, selecting fields from each table and filtering by status. Option 2 uses an implicit join, which is less clear and prone to errors. Options 3 and 4 are incomplete or invalid SQL statements with syntax issues.