Challenge your understanding of advanced R concepts and real-world data science scenarios with this quiz focused on practical analysis, modeling, and best practices. Prepare for interview success by mastering key R functions, data manipulation strategies, and statistical approaches relevant to modern data science roles.
Which function in R is typically used to merge two data frames by common columns, such as joining sales and customer data by customer_id?
Explanation: The merge function in R is specifically designed to combine data frames by matching values in one or more key columns, similar to SQL joins. concat is not a standard function in R for joining data frames; it might be confused with concatenation functions. group_by is used to group data for aggregation, not for joining. stack reshapes data from wide to long format, but does not perform joins.
Which function would you use to count all missing values in a data frame when analyzing a medical dataset for completeness in R?
Explanation: sum(is.na(data)) returns the total number of missing values (NAs) in a data frame, which is essential in data cleaning. mean(data) would compute averages, not count missing values. unique(data) gives unique elements, not missing value counts. quantile(data) returns distribution quantiles and does not identify missing values.
Which R function is most commonly used to train a linear regression model to predict house prices based on multiple features like area and number of rooms?
Explanation: The lm function in R stands for 'linear model' and fits linear regression models, commonly used for predictive modeling. prcomp is for principal component analysis, not regression. hist creates histograms for data visualization, and svd performs singular value decomposition for matrix operations, not for predicting variables.
If tasked with displaying the distribution of a categorical variable such as preferred product type in a customer dataset, which R base plotting function is most suitable?
Explanation: barplot is the recommended base R function for visualizing the frequency distribution of categorical variables. plot is more generic and may not default to bars for categorical data. line is incorrect because line plots are for continuous variables. piechart is not a base R function; the correct function is pie.
Which method in R can help identify and remove highly correlated variables before modeling to avoid multicollinearity problems?
Explanation: The cor function computes pairwise correlations between variables, helping to reveal multicollinearity that may affect modeling. apply is for applying functions over data frames or matrices, not specifically correlations. rbind concatenates data frames row-wise. scan is for reading data into R, unrelated to feature selection.
Which R function is most suitable for removing specific characters or patterns from text fields in a dataset, such as eliminating punctuation from user comments?
Explanation: gsub is used to replace or remove specified patterns (like punctuation) in character vectors, making it valuable for text cleaning tasks. matrix creates matrices and is unrelated to string manipulation. tapply applies a function over subsets of a vector, but does not directly process strings. factor converts variables to factor type, not for cleaning text.
When analyzing workflow logs and needing to perform successive data transformations, which operator makes code more readable by chaining operations in R?
Explanation: The %u003E% operator, known as the pipe, enables chaining of multiple data transformation steps for increased readability. %% is the modulo operator for remainders. ^ is used for exponentiation. + is the addition operator, and none of these facilitate readable multistep data processing like the pipe.
Which simple R function can visually identify outliers in a numeric variable, such as monthly spending, with a graphical summary?
Explanation: boxplot in R displays the distribution of a numeric variable, highlighting outliers as points outside the whiskers. cbind combines objects column-wise but does not plot data. table provides value counts for categorical data, not visualization. mean summarizes the central tendency but does not reveal outliers visually.
To split a labeled dataset into training and testing sets for model evaluation, which function or method is commonly used in base R?
Explanation: The sample function is used to randomly partition data, such as creating indices for training and test subsets. aggregate computes summaries for groups, not data splitting. replicate repeats operations multiple times but is unrelated to partitioning. arrange is not a base R function and is commonly related to ordering, not splitting.
When preparing data for modeling, which R function allows you to check or set the distinct levels of a categorical variable stored as a factor?
Explanation: levels retrieves or assigns the unique possible values (levels) in a factor variable, which is important in encoding categorical data for analysis. filter is not a base R function and is commonly used for subsetting data in other packages. matrix is for creating matrix objects. sorts is a misspelling; the actual function is sort.