Assess your understanding of essential data cleaning and transformation techniques using dplyr, including filtering, selecting, mutating, summarizing, and handling missing values. Strengthen your data manipulation skills with practical questions designed for easy comprehension.
Which dplyr function allows you to select only rows where the column 'age' is greater than 30?
Explanation: The 'filter' function is specifically used to select rows based on given conditions, such as 'age' greater than 30. 'fillter' is a misspelling and is not a valid function in dplyr. The 'select' function changes which columns are included, not which rows. The 'arrange' function orders the rows but does not filter them out.
If you want to keep only the columns 'name' and 'score' from a data frame, which dplyr function is appropriate?
Explanation: The correct function is 'select', which returns only the specified columns from the data. 'sort' is used for ordering data, not choosing columns. 'subset' can filter both rows and columns but is not a dplyr verb. 'slect' is a typo and does not exist in dplyr.
To create a new column called 'total' as the sum of 'math' and 'english' columns, which dplyr function should you use?
Explanation: 'mutate' is the dplyr function for creating new columns or modifying existing ones using calculations or expressions. 'summarize' reduces many values down to one summary per group, not per row, so it is not suitable here. 'update' is not a dplyr function. 'mutat' is a common typographical error.
How can you reorder the rows of a data frame by the column 'salary' in descending order?
Explanation: The 'arrange' function sorts rows, and 'desc' is used to specify descending order. 'descend' is not a valid function in dplyr. 'filter' selects rows based on logical conditions but does not sort them. 'arrnage' is a typo of 'arrange'.
If you need to calculate the mean 'score' for each 'class', which dplyr function helps along with group_by?
Explanation: 'summarize' works with 'group_by' to create summary statistics, such as the mean score per class. 'mutate' is for row-wise operations or creating columns, not collapsing values. 'compact' is not a dplyr function. 'summerize' is a frequent misspelling of 'summarize'.
Which function in dplyr helps remove duplicate rows from a data frame?
Explanation: The correct answer is 'distinct', which returns only unique rows from the data. 'unique_rows' and 'duplicates' are not dplyr functions. 'distict' is a typographical error and is not recognized by dplyr.
When cleaning data, which dplyr function allows you to remove all rows with missing values in any column?
Explanation: 'drop_na' removes rows where there are missing values, making it the correct function for cleaning data in this way. 'remove_na' and 'omit_na' sound similar but are not actual dplyr functions. 'keep_na' would imply retaining missing values, which is the opposite of what is needed.
Which dplyr function stacks two data frames on top of each other, combining their rows?
Explanation: 'bind_rows' is used for row-wise binding, combining multiple data frames by stacking their rows. 'merge' is a function for joining data frames by common columns but not for simply stacking them. 'combine' and 'join_rows' are not dplyr functions for this purpose.
To rename the column 'height' to 'tallness' in a data frame, which dplyr function should be used?
Explanation: 'rename' safely changes the name of existing columns. 'changenames' and 'rename_col' might look appropriate but are not valid dplyr functions. 'renmae' is a common typographical error and would result in an error.
Which dplyr function should you use to select a random sample of 10 rows from a data frame?
Explanation: 'sample_n' is designed to randomly select a specified number of rows from a data set. 'random_pick' and 'select_n' are not dplyr functions and will not work for this purpose. 'sampel_n' is a misspelling that will result in an error.